# Gathering data


- Gathering data is the first step in data wrangling
- It varies from project to project. Sometimes you are given data or pointed to it. Sometimes you need to search for the right data for your project
- Sometimes the data you need isn't readily available and you need to generate it yourself somehow

## Dataset: Finding the best movies?
#### How to find the best movies
- You may check movie rating websites like Rotten tomatoes or IMDB to help you choose. These sites contain a number of different metrics which are used to evaluate whether or not you will like a movie.
- However because the metrics don't show on the same page,figuring out the best movie can get confusing

### Use a scatter plot
- We can start with [Rotten Tomatoes: Top 100 Movies of All Time](https://www.rottentomatoes.com/browse/movies_at_home/critics:certified_fresh?page=1)
- We can use a scatter plot to look at auddience scores vs critic scores for each movie

- If we put audience scores on the horizontal axis and critic scores on the vertical axis, movies in the top right quadrant are amazing movies with high audience and critic scores
- Movies in the bottom right quadrant are critically underrated with high audience scores and low critic scores

### Generate a Word Cloud
- For lot's of people, Rodger Ebert's movie review was the only review they needed because he explained the movie in such away that they would know whether they would like it or not
- Wouldn't it be neat if we had a word cloud for each of the movies in the top 100 list at [RodgerEbert.com](https://www.rogerebert.com/)?
- We can use a Andreas Mueller's [Word Cloud Generator in Python](https://amueller.github.io/word_cloud/) to help.

- To create both of these visualizations, the data is in different spots and it will require some craftiness to gather it all

## Using a Pre-Gathered Dataset
- The dataset has four columns; Rank, Title, Rating, Number of Reviews
- The file is in **TSV**, tab separated values

### Flat File Structure
- Flat files contain tabular data in plain text format with one data recorded per line and each record or line having one or more fields
- Thee fields are separated with delimiters like commas, tabs or colons

#### Advantages of flat files
- They are text files and therefore human readable
- Lightweight
- Simple to understand
- Great for small dataset

#### Disadvantages
- Lack of standards
- Data redundancy
- Sharing data can be cumbersome
- Not great for large datasets

### Flat Files in Python
- Because flat files are human readable you can easily write your own code to parse or understand these files using python
- Open the file, read the text by line, Separate each lines content by the delimeter, store everything in your data structure of choice


###### Reading comma separated data
- We use `read_csv` function to bring csv data into a DataFrame

In [1]:
import pandas as pd

In [2]:
# importing the Rotten Tomatoes bestofrt TSV file into a data frame
df = pd.read_csv("bestofrt.tsv", sep="\t")

In [3]:
# checking to see if the file was imported correctly
df.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


## Source: Web Scrapping
### Data isn't always easy to access
- We want to get Rotten Tomatoes audience scores and the number of audience reviews to add to our dataset
- Since its not easily accessible from the website, we need to do **web scrapping** which allows us to extract data from website using code

### How does web scrapping work
- Website data is written in **HTML** which uses tags to structure the page. Because HTML and its tags are just text, the text can be accesses using parsers
- We'll be using a python parser called **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)**
- Start by exploring the structure of HTML files

### Saving HTML
#### Accessing the HTML
- __Manual Access__
- The quick way to get HTML data is by saving the HTML file to your computer manually. You can do it by clicking **Save** on your browser.

* __Programmatic Access__
- Programmatic access is preffrered for scalability and reproducibility. Two options include;
1. Downloading HTML file programmatically

    ```python
    import requests
    url = 'https://www.rottentomatoes.com/m/et_the_extraterrestrial'
    response = requests.get(url)

    # saving HTML to file
    with open('et_the_extraterrestrial.html', 'w') as file:
        file.write(response.content)
    # This is code to download 1 file, to download all 100 files we'll have to put it in a loop
    ```
2. Working with the response content live in your computers memory using the  BeautifulSoup HTML parser (doesn't require saving a file to your computer at all)

    ```python
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(response.content, 'lxml')
    ```

### Accessing HTML in this lesson
- Open the pre gathered htm folder (rt_html) which contains the rotten tomatoes HTMl for each of the Top 100 Movies Of All Time