# Gathering data


- Gathering data is the first step in data wrangling
- It varies from project to project. Sometimes you are given data or pointed to it. Sometimes you need to search for the right data for your project
- Sometimes the data you need isn't readily available and you need to generate it yourself somehow

## Dataset: Finding the best movies?
#### How to find the best movies
- You may check movie rating websites like Rotten tomatoes or IMDB to help you choose. These sites contain a number of different metrics which are used to evaluate whether or not you will like a movie.
- However because the metrics don't show on the same page,figuring out the best movie can get confusing

### Use a scatter plot
- We can start with [Rotten Tomatoes: Top 100 Movies of All Time](https://www.rottentomatoes.com/browse/movies_at_home/critics:certified_fresh?page=1)
- We can use a scatter plot to look at auddience scores vs critic scores for each movie

- If we put audience scores on the horizontal axis and critic scores on the vertical axis, movies in the top right quadrant are amazing movies with high audience and critic scores
- Movies in the bottom right quadrant are critically underrated with high audience scores and low critic scores

### Generate a Word Cloud
- For lot's of people, Rodger Ebert's movie review was the only review they needed because he explained the movie in such away that they would know whether they would like it or not
- Wouldn't it be neat if we had a word cloud for each of the movies in the top 100 list at [RodgerEbert.com](https://www.rogerebert.com/)?
- We can use a Andreas Mueller's [Word Cloud Generator in Python](https://amueller.github.io/word_cloud/) to help.

- To create both of these visualizations, the data is in different spots and it will require some craftiness to gather it all

## Using a Pre-Gathered Dataset
- The dataset has four columns; Rank, Title, Rating, Number of Reviews
- The file is in **TSV**, tab separated values

### Flat File Structure
- Flat files contain tabular data in plain text format with one data recorded per line and each record or line having one or more fields
- Thee fields are separated with delimiters like commas, tabs or colons

#### Advantages of flat files
- They are text files and therefore human readable
- Lightweight
- Simple to understand
- Great for small dataset

#### Disadvantages
- Lack of standards
- Data redundancy
- Sharing data can be cumbersome
- Not great for large datasets

### Flat Files in Python
- Because flat files are human readable you can easily write your own code to parse or understand these files using python
- Open the file, read the text by line, Separate each lines content by the delimeter, store everything in your data structure of choice


###### Reading comma separated data
- We use `read_csv` function to bring csv data into a DataFrame

In [1]:
import pandas as pd

In [2]:
# importing the Rotten Tomatoes bestofrt TSV file into a data frame
df = pd.read_csv("bestofrt.tsv", sep="\t")

In [3]:
# checking to see if the file was imported correctly
df.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


## Source: Web Scrapping
### Data isn't always easy to access
- We want to get Rotten Tomatoes audience scores and the number of audience reviews to add to our dataset
- Since its not easily accessible from the website, we need to do **web scrapping** which allows us to extract data from website using code

### How does web scrapping work
- Website data is written in **HTML** which uses tags to structure the page. Because HTML and its tags are just text, the text can be accesses using parsers
- We'll be using a python parser called **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)**
- Start by exploring the structure of HTML files

### Saving HTML
#### Accessing the HTML
- __Manual Access__
- The quick way to get HTML data is by saving the HTML file to your computer manually. You can do it by clicking **Save** on your browser.

* __Programmatic Access__
- Programmatic access is preffrered for scalability and reproducibility. Two options include;
1. Downloading HTML file programmatically

    ```python
    import requests
    url = 'https://www.rottentomatoes.com/m/et_the_extraterrestrial'
    response = requests.get(url)

    # saving HTML to file
    with open('et_the_extraterrestrial.html', 'w') as file:
        file.write(response.content)
    # This is code to download 1 file, to download all 100 files we'll have to put it in a loop
    ```
2. Working with the response content live in your computers memory using the  BeautifulSoup HTML parser (doesn't require saving a file to your computer at all)

    ```python
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(response.content, 'lxml')
    ```

### Accessing HTML in this lesson
- Open the pre gathered htm folder (rt_html) which contains the rotten tomatoes HTMl for each of the Top 100 Movies Of All Time
- We are using pre gathered data in this code because scrapping code can break easily when web redesigns occur, which makes scrapping brittle and not recommended for projects with longevity
- There are some [ethical issues](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01) involved in web scrapping

### HTML Files in Python
- HTML files are text files that can be opened and inspected in the text editor which means you can write your own code to pass the text
- For example to get our audience score metrics, we could find all the instances of the `span` tag using Python tools like `str.find` or regular expressions to search for and extract patterns in the text. But we have a better tool

### Beautiful Soup
- **Beautiful soup** is an HTML parser written in the python programming language
- The name is derived from the "tag soup" which refers to the unstructured and difficult-to-parse HTML found on many websites

## Using Beautiful Soup
- To get started we need to import Beautiful Soup

In [2]:
from bs4 import BeautifulSoup

- Next we make the soup by passing the path to the HTML file into a filehandle then passing that filehandle into the BeautifulSoup constructor along with the parser
- We are using `lxml` which is the most popular parser
- `conda install lxml`

In [15]:
with open("rt_html/12_years_a_slave.html") as file:
    soup = BeautifulSoup(file, "lxml")

In [28]:
# soup

- This results looks exactly like an HTML document and we can use methods in the Beautiful Soup library to easily find and extract data from this HTML

### Use the Find Method
- `find()` is one of the most popular Beautiful Soup methods. It's similar to the find feature in a text editor

In [4]:
# getting the title of our movie
soup.find('title')

<title>12 Years a Slave (2013) - Rotten Tomatoes</title>

- we get the title element of the web page and not the title of the movie
- To get the movie title we need to do some string slicing
- We can use `.contents` to return a list of the tag's children. Because there is only one item in the title tag, the list is one item long so we can access it using the index `[0]`

In [18]:
# .contents returns a list of all the tags children
soup.find('title').contents

['12 Years a Slave\xa0(2013) - Rotten Tomatoes']

In [20]:
soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]

'12 Years a Slave\xa0(2013)'

In [13]:
# soup.find('title').count[0][:-len(' - Rotten Tomatoes')]
soup.find('title').string[:-len(' - Rotten Tomatoes')]

'12 Years a Slave\xa0(2013)'

In [21]:
len(' - Rotten Tomatoes')

18

- The `\xa0` in the returned title is unicode from non-breaking space which we would need to deal with later in the cleaning step

- **Resources**
- Beautiful Soup`.find` [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)
- Beautiful Soup`.contents` [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children)

- We are going to use Beautiful Soup to extract our desired Audience Score Metric and number of audience ratings along with the movie title for each html file then save them in a pandas DataFrame
- Write code that;
    * creates an empty list `df_list` to which dictionaries will be appended. The list of dictionaries will eventually be converted into a pandas DataFrame
    * Loops through each movie's Rotten Tomatoes HTML file in the `rt_html` folder
    * Opens each HTML file and passes it into a file handle called file
    * creates a DataFrame called df by converting `df_list` using the `pd.DataFrame` constructor
- [Resource](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree)

In [22]:
from bs4 import BeautifulSoup
import os
import pandas as pd

- Make the soup by passing in the file handle
- The first thing to grab from the HTML was the title, use the find method to get the only title tag in the document and accessing contents of that file using `.contents` and grabbing the first item in that tag and slicing off the last `-Rotten Tomatoes`

- The next thing to grab was the audience score, take a look at our html code to see where that's at (there's a 72% and its within a div with the class titled; `audience-score meter`).The 72% is within the only span tag within the outermost div tag. Because its the first span tag we can use the `.find` method

- To grab the number of audience rating s is abit complicated. If we scroll down we see the user ratings. The outer most div has the class `audience-info hidden-xs superPageFontColor`
- When you print it out you notice within that div tag is two other div tags and the number of audience ratings is in the second div tag. We can use `findall` and take the second item in that returned list. The second item being index number 1, if we look at the contents we see the number of audience ratings, third item index number 2
- There is a bunch of white space we'll strip out using the python strip function. It's currently in string form, we later have to convert to integer. To do that we'll have to remove the commas. We'll do that using the python's replace character (replacing commas with empty characters)

- All that's left to do is convert our list of dictionaries, `df_list` to a pandas DataFrame
- Make sure the pandas df is imported and aliased then specify the column order

In [41]:
# List of dictionaries to build file by file and later convert to a DataFrame
df_list= []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_='audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        # print(num_audience_ratings)
        # break
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents
        # print(num_audience_ratings)
        # break
        num_audience_ratings = num_audience_ratings[2].strip().replace(',', '')
        # print(num_audience_ratings)
        # break

        # appending to list of dictionaries
        df_list.append(
            {
                'title': title,
                'audience_score': int(audience_score),
                'number_of_audience_ratings': int(num_audience_ratings)
            }
        )

df = pd.DataFrame(df_list, columns=['title', 'audience_score', 'number_of_audience_ratings'])


In [42]:
# taking a look at the data frame
df

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,The Big Sick (2017),90,23391
1,All Quiet on the Western Front (1930),89,17768
2,Gone With the Wind (1939),93,292794
3,Zootopia (2016),92,98633
4,Bicycle Thieves (Ladri di biciclette) (1949),94,33723
...,...,...,...
95,L.A. Confidential (1997),94,149772
96,Pinocchio (1940),72,278682
97,Argo (2012),90,207373
98,The Grapes of Wrath (1940),88,23954


- Run the cell below to see if your solution is correct. If an AssertError is thrown your solution is incorrect. If no error is thrown your solution is correct

In [None]:
# solution test
df_solution = pd.read_pickle('df_solution.pkl')
df.sort_values('title', inplace=True)
df.reset_index(inplace=True, drop=True)
df_solution.sort_values('title', inplace=True)
df_solution.reset_index(inplace=True, drop=True)
pd.testing.assert_frame_equal(df, df_solution)

## More Information
- [Learning about the non breaking issue encounterd earlier](https://stackoverflow.com/questions/19508442/beautiful-soup-and-unicode-problems)
- [Fixing the non-breaking space issue](https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python)

- We've gathered enough data to produce a scatter plot with audience on the horizontal axis and critical score on the vertical axis
- The next step includes joining our two dataFrames then visualizing

## Downloading Files from the internet
#### Starting the Roger Ebert review word cloud
- We'll need the text from each of his reviews for each of the Rotten tomatoes Top 100 movies of all time list that lives on his website
- This text has been pre gathered in the form of 100.txtx files that can be downloaded pragmatically
- Downoading files from the internet pragmatically is best for scalability and reproducibility
- Only python's request library is needed in practice but a bit of HTTP (hyper text transfer protocal) helps in understanding what's going on under the hood

### HTTP
- Is the language that web browsers (chrome, safari) and web servers (apache, basically computers where the contents of a website are stored) use to communicate with each other
- Every time you open a webpage or download a file or watch a video its HTTP that makes it possible
- HTTP is a request/response protocol
    * The computer (client) sends a request to a server for some file, the web server sends back a response if the request is valid

## Request: HTTP for Humans
- Python's request library makes http requests easy, it has a method called `GET` which will send the request for us, return the contents of the file we requested(text file) which we can then save to a file
- Programmatically downloading a file from the internet;

In [4]:
import requests
import os

In [6]:
# creating the folder ebert_reviews if it doesn't exists already
folder_name = "/home/mark/data_science/Udacity/data_analysis/3-data_wrangling/1-gathering/ebert_reviews"
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# request code
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt"
response = requests.get(url)
response

<Response [200]>

- If the request is succesful the HTTP request will return a 200 response, the status code for a succesful response
- This tells you that all the text in our text file is in the computers working memory in the body of the response

In [8]:
response.content

b'E.T. The Extra-Terrestrial (1982)\nhttp://www.rogerebert.com/reviews/great-movie-et-the-extra-terrestrial-1982\nDear Raven and Emil:\n\nSunday we sat on the big green couch and watched "E.T. The Extra-Terrestrial" together with your mommy and daddy. It was the first time either of you had seen it, although you knew a little of what to expect because we took the "E.T." ride together at the Universal tour. I had seen the movie lots of times since it came out in 1982, so I kept one eye on the screen and the other on the two of you. I wanted to see how a boy on his fourth birthday, and a girl who had just turned 7 a week ago, would respond to the movie.\n\nWell, it "worked" for both of you, as we say in Grandpa Roger\'s business.\n\nRaven, you never took your eyes off the screen--not even when it looked like E.T. was dying and you had to scoot over next to me because you were afraid.\n\nEmil, you had to go sit on your dad\'s knee a couple of times, but you never stopped watching, either.

### Access the content and write to a file
- use the `requests.content` method and some basic I/O to save this file to our computer
- So we'll open a file called `11-e.t-the-extra-terrestrial.txt` i.e everything after the last slash in the URl 
- To get everything after the last slash we'll use python's split function and select the last item in the list returned
- We need to open this file which will then write the contents of the response variable to 
- We'll open this in the `wb` mode which stands for write binary because the `response.content` is in bytes,not text
- When we open the files in a text editor or pandas the bytes will be rendered as human readable text
- Then we write to the file handle we've opened, `file.write.response.content`

In [10]:
# downloading one file programmatically
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)

In [11]:
# checking the contents of our folder to make sure it worked
os.listdir(folder_name)

['11-e.t.-the-extra-terrestrial.txt']

* If we have lots of files to download though like we do for all of these Ebert reviews for the top 100 Rotten Tomatoes movie list, we can use a for loop over all of the file URLs to minimize code repetition

In [1]:
ebert_review_urls = [
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9900_1-the-wizard-of-oz-1939-film/1-the-wizard-of-oz-1939-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_2-citizen-kane/2-citizen-kane.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_3-the-third-man/3-the-third-man.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_4-get-out-film/4-get-out-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_5-mad-max-fury-road/5-mad-max-fury-road.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_6-the-cabinet-of-dr.-caligari/6-the-cabinet-of-dr.-caligari.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_7-all-about-eve/7-all-about-eve.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_8-inside-out-2015-film/8-inside-out-2015-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_9-the-godfather/9-the-godfather.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_10-metropolis-1927-film/10-metropolis-1927-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_12-modern-times-film/12-modern-times-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_14-singin-in-the-rain/14-singin-in-the-rain.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_15-boyhood-film/15-boyhood-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_16-casablanca-film/16-casablanca-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_17-moonlight-2016-film/17-moonlight-2016-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_18-psycho-1960-film/18-psycho-1960-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_19-laura-1944-film/19-laura-1944-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_20-nosferatu/20-nosferatu.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_21-snow-white-and-the-seven-dwarfs-1937-film/21-snow-white-and-the-seven-dwarfs-1937-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_22-a-hard-day27s-night-film/22-a-hard-day27s-night-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_23-la-grande-illusion/23-la-grande-illusion.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_25-the-battle-of-algiers/25-the-battle-of-algiers.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_26-dunkirk-2017-film/26-dunkirk-2017-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_27-the-maltese-falcon-1941-film/27-the-maltese-falcon-1941-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_29-12-years-a-slave-film/29-12-years-a-slave-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_30-gravity-2013-film/30-gravity-2013-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_31-sunset-boulevard-film/31-sunset-boulevard-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_32-king-kong-1933-film/32-king-kong-1933-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_33-spotlight-film/33-spotlight-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_34-the-adventures-of-robin-hood/34-the-adventures-of-robin-hood.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_35-rashomon/35-rashomon.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_36-rear-window/36-rear-window.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_37-selma-film/37-selma-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_38-taxi-driver/38-taxi-driver.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_39-toy-story-3/39-toy-story-3.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_40-argo-2012-film/40-argo-2012-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_41-toy-story-2/41-toy-story-2.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_42-the-big-sick/42-the-big-sick.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_43-bride-of-frankenstein/43-bride-of-frankenstein.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_44-zootopia/44-zootopia.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_45-m-1931-film/45-m-1931-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_46-wonder-woman-2017-film/46-wonder-woman-2017-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_48-alien-film/48-alien-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_49-bicycle-thieves/49-bicycle-thieves.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_50-seven-samurai/50-seven-samurai.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_51-the-treasure-of-the-sierra-madre-film/51-the-treasure-of-the-sierra-madre-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_52-up-2009-film/52-up-2009-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_53-12-angry-men-1957-film/53-12-angry-men-1957-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_54-the-400-blows/54-the-400-blows.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_55-logan-film/55-logan-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_57-army-of-shadows/57-army-of-shadows.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_58-arrival-film/58-arrival-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_59-baby-driver/59-baby-driver.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_60-a-streetcar-named-desire-1951-film/60-a-streetcar-named-desire-1951-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_61-the-night-of-the-hunter-film/61-the-night-of-the-hunter-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_62-star-wars-the-force-awakens/62-star-wars-the-force-awakens.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_63-manchester-by-the-sea-film/63-manchester-by-the-sea-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_64-dr.-strangelove/64-dr.-strangelove.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_66-vertigo-film/66-vertigo-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_67-the-dark-knight-film/67-the-dark-knight-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_68-touch-of-evil/68-touch-of-evil.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_69-the-babadook/69-the-babadook.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_72-rosemary27s-baby-film/72-rosemary27s-baby-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_73-finding-nemo/73-finding-nemo.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_74-brooklyn-film/74-brooklyn-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_75-the-wrestler-2008-film/75-the-wrestler-2008-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_77-l.a.-confidential-film/77-l.a.-confidential-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_78-gone-with-the-wind-film/78-gone-with-the-wind-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_79-the-good-the-bad-and-the-ugly/79-the-good-the-bad-and-the-ugly.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_80-skyfall/80-skyfall.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_82-tokyo-story/82-tokyo-story.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_83-hell-or-high-water-film/83-hell-or-high-water-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_84-pinocchio-1940-film/84-pinocchio-1940-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_85-the-jungle-book-2016-film/85-the-jungle-book-2016-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991a_86-la-la-land-film/86-la-la-land-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_87-star-trek-film/87-star-trek-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_89-apocalypse-now/89-apocalypse-now.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_90-on-the-waterfront/90-on-the-waterfront.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_91-the-wages-of-fear/91-the-wages-of-fear.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_92-the-last-picture-show/92-the-last-picture-show.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_93-harry-potter-and-the-deathly-hallows-part-2/93-harry-potter-and-the-deathly-hallows-part-2.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_94-the-grapes-of-wrath-film/94-the-grapes-of-wrath-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_96-man-on-wire/96-man-on-wire.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_97-jaws-film/97-jaws-film.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_98-toy-story/98-toy-story.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_99-the-godfather-part-ii/99-the-godfather-part-ii.txt',
    'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_100-battleship-potemkin/100-battleship-potemkin.txt'
    ]

- Writing a loop to download all of the Rodger Ebert review files programmatically

In [8]:
folder_name = "/home/mark/data_science/Udacity/data_analysis/3-data_wrangling/1-gathering/ebert_reviews"
if not os.path.exists(folder_name):
    os.mkdir(folder_name)

for url in ebert_review_urls:
    response = requests.get(url)
    with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
        file.write(response.content)

## Text File Structure
- A text file refers to a file that uses a specific character set and contains no formatting like italics or bolding and also has no media like images or video
- Lines of text are separated by new line characters or backslash end in python. These characters are invisible in most software applications like text editors
- Flat files like `tsv` are technically text files but they have a specific structure
- These Rodger Ebert review text files though are just blobs of text, no defined structure like in the tabular structure. All of the techniques we are going to apply though can be done on any text file regardless of structure
- Regarding the specific character sets, have you ever opened a document and its characters are all garbled like a bunch of question marks in a row or a bunch of weird characters? that's because your text editor, browser or word processor or whatever else you are trying to read the document is assuming the wrong encoding or the wrong scheme for converting the character set bits to letters and numbers.
- You need to select the right encoding to display the document properly
- Character sets and encodings are two things that every programmer needs to be aware of when working with any text data including flat files like csv files, html and json

- **Character sets** are the collections of characters that are available for use in a system
- **Encoding** is the scheme for converting the character sets bits to letters and numbers

## Text Files in Python
- Gathering data from text files in python mean opening and reading from files
- If you are using pandas it also means storing the text data you just read in a pandas dataframe
- We have 88 Roger Ebert reviews to open and read. We'll need a loop to iterate through all the files in this folder to open and read each
- There are two ways to do it;
    * Using the `os` library
    * using the `glob` library
- The glob library allows for Unix-style pathname pattern expansion(using glob patterns to specify sets of filenames)
- These glob patterns use wildcard characters
- `glob.glob(pathname, *, recursive=False)` returns a list of pathnames that match pathname
- We want all file names that end in `.txt` which in our folder is all of them and because glob.glob returns a list we can loop through that directly

In [14]:
import glob
import pandas as pd

In [10]:
for ebert_review in glob.glob("ebert_reviews/*.txt"):
    print(ebert_review)

ebert_reviews/66-vertigo-film.txt
ebert_reviews/31-sunset-boulevard-film.txt
ebert_reviews/32-king-kong-1933-film.txt
ebert_reviews/77-l.a.-confidential-film.txt
ebert_reviews/55-logan-film.txt
ebert_reviews/37-selma-film.txt
ebert_reviews/9-the-godfather.txt
ebert_reviews/7-all-about-eve.txt
ebert_reviews/29-12-years-a-slave-film.txt
ebert_reviews/16-casablanca-film.txt
ebert_reviews/54-the-400-blows.txt
ebert_reviews/94-the-grapes-of-wrath-film.txt
ebert_reviews/53-12-angry-men-1957-film.txt
ebert_reviews/35-rashomon.txt
ebert_reviews/91-the-wages-of-fear.txt
ebert_reviews/73-finding-nemo.txt
ebert_reviews/93-harry-potter-and-the-deathly-hallows-part-2.txt
ebert_reviews/40-argo-2012-film.txt
ebert_reviews/38-taxi-driver.txt
ebert_reviews/43-bride-of-frankenstein.txt
ebert_reviews/45-m-1931-film.txt
ebert_reviews/75-the-wrestler-2008-film.txt
ebert_reviews/18-psycho-1960-film.txt
ebert_reviews/15-boyhood-film.txt
ebert_reviews/11-e.t.-the-extra-terrestrial.txt
ebert_reviews/98-toy-sto

- You can pass the entire path for each file into the open function in python, while we've been opening files with open then the path to the file then as file or whatever handle you use. In python3 when opening text to read you should use `open` with an explicit encoding which comes after the encoding parameter
- Doing so means you get correctly decoded Unicode or an error right off the bat making it much easier to debug
- The actual encoding depends on the source of the text (look at the html source file)
- We don't want all the text data in one big chunk which will be done using `file.read`. We can check for one of the files with a print statement and then a break from the loop (prints the first file)
- Instead we want the first line(movie title), second line (URL) and everything from the third line onwards (review text) as separate pieces of data so we cant just use `file.read`. Since text files are separated by new line characters and the file open returned from `with open as file` is an iterator we can read the file line by line
- If you want to read one line you use `file.readline` which is the title of the movie
- After printing there's abit of whitespace below which is actually the newline character, we can get rid of it by slicing it off at the end of the string
- Next we grab the url and the full review text, before that recall that we need all this data in a pandas dataframe so we need to buil done
- The most computationally efficeint way to do that is first to create an empty list then populate that list one by one as we iterate through the loop
- We'll fill the list with dictionaries and later be converted to a pandas dataframe
- Since review url is on line two we'll just use readline again to read the next line in the file. To read the next of the lines in the text file we use `file.read`

In [15]:
df_list = []
for ebert_review in glob.glob("ebert_reviews/*.txt"):
    with open(ebert_review, encoding='utf-8') as file:
        # file.read()
        # print(file.read())  # first file in the folder
        # print(file.readline()[:-1]) # first line in the file, title of the movie
        # break
        title = file.readline()[:-1]
        review_url = file.readline()[:-1]
        review_text = file.read()
        df_list.append({
            'title': title,
            'review_url': review_url,
            'review_text': review_text
        })

df = pd.DataFrame(df_list, columns=['title', 'review_url', 'review_text'])

In [16]:
df.head()

Unnamed: 0,title,review_url,review_text
0,Vertigo (1958),http://www.rogerebert.com/reviews/great-movie-...,“Did he train you? Did he rehearse you? Did he...
1,Sunset Boulevard (1950),http://www.rogerebert.com/reviews/great-movie-...,"Billy Wilder's ""Sunset Boulevard” is the portr..."
2,King Kong (1933),http://www.rogerebert.com/reviews/great-movie-...,"On good days I consider ""Citizen Kane"" the sem..."
3,L.A. Confidential (1997),http://www.rogerebert.com/reviews/great-movie-...,"""L.A. Confidential"" finished at No. 1 in a lis..."
4,Logan (2017),http://www.rogerebert.com/reviews/logan-2017,Is “Logan” more powerful because of what the s...


## Application Programming Interfaces
- Getting each movies poster to add to our word cloud
- We would scrap the image url from the html but a better way is to access  them through an API
- APIs let us access data from the internet in a reasonably easy manner
- There are many open source APIs, we'll use [mediaWiki](https://www.mediawiki.org/wiki/MediaWiki) which is an open source API for wikipedia

### APIs and Access Libraries
- THe goal is to get the movie poster images somehow
- Rotten tomatoes does have an API and does provide audience scores which means we could have hit the API instead of scrapping it off of the Rotten Tomatoes web page. But the API doesn't provide posters and images in addition it requires you to apply for access before using it
- When given a choice always use the API over scrapping. Scrapping is brittle and breaks with web layout redesigns because the underlying HTML has changed
- APIs and their access libraries allow programmers to access data in a super simple manner
- [rtsimple](https://pypi.org/project/rtsimple/) is an access library for rotten tomatoes that uses python. If we had permission to use the Rotten tomatoes API we could import rtsimple, use our API key, create an object for each movie and access the ratings data directly from the movie object

    ```python
    import rtsimple as rt
    rt.api_key = '<your_api_key>'
    movie = rt.Movies('<movie_id>')
    movie.ratings['audience_score']
    ```

## MediaWiki API
- API that holds all of the wikipedia data
- THey have a great [tutorial](https://www.mediawiki.org/wiki/API:Tutorial) on their website on how their API calls are structured

### wptools Library
- There are a bunch of different access libraries for MediaWIki to satisfy the variety of programming languages that exist. Here's a [list](https://www.mediawiki.org/wiki/API:Client_code#Python) for python
- This is pretty standard for most APIs , some libraries are better than others which again is standard.
- For MediaWiki the most upto date and human readable one in python is called [wptools](https://github.com/siznax/wptools)
- The analogous relationship for twitter is;
    * MediaWiki API -> wptools
    * Twitter API -> tweepy
    
- wptools has an even simpler tutorial on their github page using [Mahatma Gandthi Wikipedia Page](https://en.wikipedia.org/wiki/Mahatma_Gandhi) as a working example

- To get a `page` object the [usage](https://github.com/siznax/wptools/wiki/Usage#page-usage) is as follows;

In [1]:
import wptools

In [11]:
# Mahtma Gandhi is the last bit of the WikiPedia url for that page
page = wptools.page('E.T._the_Extra-Terrestrial')

In [12]:
# calling get will automatically fetch extracts
page.get_query()

en.wikipedia.org (query) E.T._the_Extra-Terrestrial
en.wikipedia.org (query) E.T. the Extra-Terrestrial (&plcontinue=...
E.T. the Extra-Terrestrial (en) data
{
  aliases: <list(2)> E.T., ET
  assessments: <dict(4)> United States, Film, Science Fiction, Lib...
  description: 1982 film directed by Steven Spielberg
  extext: <str(2186)> _**E.T. the Extra-Terrestrial**_ (or simply ...
  extract: <str(2311)> <p class="mw-empty-elt"></p><p><i><b>E.T. t...
  label: E.T. the Extra-Terrestrial
  length: 100,768
  links: <list(630)> 12 Angry Men (1957 film), 12 Monkeys, 12 Year...
  modified: <dict(1)> page
  pageid: 73441
  random: Brachyponera luteipes
  redirects: <list(37)> {'pageid': 177061, 'ns': 0, 'title': 'E.T....
  requests: <list(2)> query, query
  title: E.T. the Extra-Terrestrial
  url: https://en.wikipedia.org/wiki/E.T._the_Extra-Terrestrial
  url_raw: <str(67)> https://en.wikipedia.org/wiki/E.T._the_Extra-...
  watchers: 358
  wikibase: Q11621
  wikidata_url: https://www.wikidata.

<wptools.page.WPToolsPage at 0x7f58c16a41f0>

In [13]:
# Accessing the image attribute will return the images for this page
page.data['label']

'E.T. the Extra-Terrestrial'

## Terms
- Access Library - A set of code that can be used to access an API

## JSON File Structure
- Most data from APIs comes in JSON or XML format
### JSON vs XML
- JSON stands for JavaScript Object Notation
- XML stands for eXtensible Markup Language

- JSON is especially great for representing and accessing complicated data hierachies (sometimes we have data with fields that have multiple entries, sometimes fileds have sub fields)

### JSON structure
- JSON is built on two key structures;

1. __JSON Objects__
- JSON objects are a collection of key value pairs
- In python JSON objects are intepretated as dictionaries and you can access them like you would a standard python dictionary
- JSON object keys must be strings

2. __JSON Arrays__
- A JSON array is an ordered list of values sorrounded by square brackets
- In Python JSON arrays are interpreted and accessed like lists

- The values for both JSON objects and arrays can be any valid JSON data type: string, number, object, array, Boolean or null
- When objects and arrays are combined it is called nesting

## JSON Files in python
### Accessing JSON files in python
- JSON files can be accessed in python just like dictionaries and lists bacuse JSON objects are

    ```python
    infobox_json['Box office']
    infobox_json['Produced by'][0]
    infobox_json['Release'][1]['Location']
    ```

- With the knowledge on APIs, JSON and downloading files from the internet we now have the knowledge to download all of the movie poster images

- Let's inspect the wptools page object for the [E.T. The Extra-Terrestial Wikipedia page](https://en.wikipedia.org/wiki/E.T._the_Extra-Terrestrial). In the Jupyter Notebook below, you will access the images and infobox attributes and the data within them.

In [2]:
import wptools

In [4]:
page = wptools.page('E.T._the_Extra-Terrestrial').get()

en.wikipedia.org (query) E.T._the_Extra-Terrestrial
en.wikipedia.org (query) E.T. the Extra-Terrestrial (&plcontinue=...
en.wikipedia.org (parse) 73441
www.wikidata.org (wikidata) Q11621
www.wikidata.org (labels) P214|P3129|P2603|P4786|P5693|Q652644|P1...
www.wikidata.org (labels) Q1720784|P244|P2631|Q28732982|Q981030|Q...
www.wikidata.org (labels) P2758|P646|P6398|P6839|Q7341915|Q103360...
www.wikidata.org (labels) Q1748409|Q56887459|P5021|Q56887384|P725...
www.wikidata.org (labels) Q237207|P3854|Q739633|P950|Q3897561|Q18...
www.wikidata.org (labels) Q488651
en.wikipedia.org (restbase) /page/summary/E.T._the_Extra-Terrestrial
en.wikipedia.org (imageinfo) File:E t the extra terrestrial ver3....
E.T. the Extra-Terrestrial (en) data
{
  aliases: <list(2)> E.T., ET
  assessments: <dict(4)> United States, Film, Science Fiction, Lib...
  claims: <dict(130)> P1562, P57, P272, P345, P31, P161, P373, P48...
  description: 1982 American science fiction film
  exhtml: <str(485)> <p><i><b>E.T. th

In [11]:
# Access the first image in the images attribute, which is a JSON array.
page.data['image'][0]

{'kind': 'parse-image',
 'file': 'File:E t the extra terrestrial ver3.jpg',
 'orig': 'E t the extra terrestrial ver3.jpg',
 'timestamp': '2016-06-04T10:30:46Z',
 'size': 83073,
 'width': 253,
 'height': 394,
 'url': 'https://upload.wikimedia.org/wikipedia/en/6/66/E_t_the_extra_terrestrial_ver3.jpg',
 'descriptionurl': 'https://en.wikipedia.org/wiki/File:E_t_the_extra_terrestrial_ver3.jpg',
 'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=7419503',
 'title': 'File:E t the extra terrestrial ver3.jpg',
 'metadata': {'DateTime': {'value': '2016-06-04 10:30:46',
   'source': 'mediawiki-metadata',
   'hidden': ''},
  'ObjectName': {'value': 'E t the extra terrestrial ver3',
   'source': 'mediawiki-metadata',
   'hidden': ''},
  'CommonsMetadataExtension': {'value': 1.2,
   'source': 'extension',
   'hidden': ''},
  'Categories': {'value': 'All non-free media|E.T. the Extra-Terrestrial|Fair use images of film posters|Files with no machine-readable author|Noindexed pages|Wik

In [18]:
# Access the director key of the infobox attribute, which is a JSON object.
page.data['infobox']['director']

'[[Steven Spielberg]]'

## More JSON in python
- In the above example JSON data was sourced from an API. That isn't always the case though

- Sometimes you are given a text file with human readable JSON within it. For this situation the [json](https://docs.python-guide.org/scenarios/json/) library is indispensable. It can parse JSON from strings or files and it can parse json into a python dictionary or list
- It can also convert a python dictionaries or lists into JSON strings
- [Reading and Writing JSON to a File in Python](https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/)
- Pandas also has JSON functions (`read_json` and `to_json` DataFrame method) but the hierachal advantage of json is wasted in pandas tabular dataframe so the uses are limited

## Mashup: APIs,Downloading files programmatically and JSON
- Download all of the movie poster images for the Roger Ebert review word clouds

- Two key things to be aware of;
    1. Wikipedia Page Titles
    - To access wikipedia page data via the MediaWiki API with wptools  you need each movie's wikipedia page title (what comes after the slash)

    2.Downloading Image Files
    - Downloading images may seem tricky from a reading and writing perspective in comparison to text files which you can read line by line
    - But in reality image files arent special, they are just binary files
    - To interract with them you don't need special software, you can use regular file opening, reading and writing techniques
        
        ```python
        import requests
        r = requests.get(url)
        with open(folder_name + '/' + filename, 'wb') as f:
            f.write(r.content)
        ```
    - This technique can be error prone. It will work most of the time but sometimes the file you write to will be damaged
    - The erroes encountered is why request library maintainers recommend using the [PIL](https://pillow.readthedocs.io/en/stable/) library and `BytesIO` from the `io` library for non-text requests like images
    - They recommend that you access the response body as bytes, for non-text requests. For example, to create an image from binary data returned by a request:
        
        ```python
        import requests
        from PIL import Image
        from io import BytesIO
        img = Image.open(BytesIO(r.content))
        ```

    - Though you may still encounter a similar file error this code above will atleast warn us with an error message at which point we can manually download the problematic images

## Task
- Gather the last piece of data for the Roger Ebert review word clouds now: the movie poster image files
- Let's also keep each image's URL to add to the master DataFrame later.
- We'll use a loop and here's how the parts inside that loop will work in order;
    * We're going to query the mediaWiki Api using `wptools` to get a movie poster URL via each page object's `image` attribute
    * using that url we'll programmatically download that image into a folder called `bestofrt_posters`

In [1]:
import pandas as pd
import wptools
import os
import requests
from PIL import Image
from io import BytesIO

In [2]:
title_list = [
    'The_Wizard_of_Oz_(1939_film)',
    'Citizen_Kane',
    'The_Third_Man',
    'Get_Out_(film)',
    'Mad_Max:_Fury_Road',
    'The_Cabinet_of_Dr._Caligari',
    'All_About_Eve',
    'Inside_Out_(2015_film)',
    'The_Godfather',
    'Metropolis_(1927_film)',
    'E.T._the_Extra-Terrestrial',
    'Modern_Times_(film)',
    'It_Happened_One_Night',
    "Singin'_in_the_Rain",
    'Boyhood_(film)',
    'Casablanca_(film)',
    'Moonlight_(2016_film)',
    'Psycho_(1960_film)',
    'Laura_(1944_film)',
    'Nosferatu',
    'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
    "A_Hard_Day%27s_Night_(film)",
    'La_Grande_Illusion',
    'North_by_Northwest',
    'The_Battle_of_Algiers',
    'Dunkirk_(2017_film)',
    'The_Maltese_Falcon_(1941_film)',
    'Repulsion_(film)',
    '12_Years_a_Slave_(film)',
    'Gravity_(2013_film)',
    'Sunset_Boulevard_(film)',
    'King_Kong_(1933_film)',
    'Spotlight_(film)',
    'The_Adventures_of_Robin_Hood',
    'Rashomon',
    'Rear_Window',
    'Selma_(film)',
    'Taxi_Driver',
    'Toy_Story_3',
    'Argo_(2012_film)',
    'Toy_Story_2',
    'The_Big_Sick',
    'Bride_of_Frankenstein',
    'Zootopia',
    'M_(1931_film)',
    'Wonder_Woman_(2017_film)',
    'The_Philadelphia_Story_(film)',
    'Alien_(film)',
    'Bicycle_Thieves',
    'Seven_Samurai',
    'The_Treasure_of_the_Sierra_Madre_(film)',
    'Up_(2009_film)',
    '12_Angry_Men_(1957_film)',
    'The_400_Blows',
    'Logan_(film)',
    'All_Quiet_on_the_Western_Front_(1930_film)',
    'Army_of_Shadows',
    'Arrival_(film)',
    'Baby_Driver',
    'A_Streetcar_Named_Desire_(1951_film)',
    'The_Night_of_the_Hunter_(film)',
    'Star_Wars:_The_Force_Awakens',
    'Manchester_by_the_Sea_(film)',
    'Dr._Strangelove',
    'Frankenstein_(1931_film)',
    'Vertigo_(film)',
    'The_Dark_Knight_(film)',
    'Touch_of_Evil',
    'The_Babadook',
    'The_Conformist_(film)',
    'Rebecca_(1940_film)',
    "Rosemary%27s_Baby_(film)",
    'Finding_Nemo',
    'Brooklyn_(film)',
    'The_Wrestler_(2008_film)',
    'The_39_Steps_(1935_film)',
    'L.A._Confidential_(film)',
    'Gone_with_the_Wind_(film)',
    'The_Good,_the_Bad_and_the_Ugly',
    'Skyfall',
    'Rome,_Open_City',
    'Tokyo_Story',
    'Hell_or_High_Water_(film)',
    'Pinocchio_(1940_film)',
    'The_Jungle_Book_(2016_film)',
    'La_La_Land_(film)',
    'Star_Trek_(film)',
    'High_Noon',
    'Apocalypse_Now',
    'On_the_Waterfront',
    'The_Wages_of_Fear',
    'The_Last_Picture_Show',
    'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
    'The_Grapes_of_Wrath_(film)',
    'Roman_Holiday',
    'Man_on_Wire',
    'Jaws_(film)',
    'Toy_Story',
    'The_Godfather_Part_II',
    'Battleship_Potemkin'
]

In [3]:
folder_name = 'bestofrt_posters'
# Make directory if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [7]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list:
    try:
        # This cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        page = wptools.page(title, silent=True)
        # Your code here (three lines)
        images = page.get().data['image']
        # First image is usually the poster
        first_image_url = images[0]['url']
        r = requests.get(first_image_url)
        # Download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        # Append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})

    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

1
1_The_Wizard_of_Oz_(1939_film): cannot identify image file <_io.BytesIO object at 0x7f98a07f6ef0>
2
2_Citizen_Kane: cannot identify image file <_io.BytesIO object at 0x7f98ccf454f0>
3
3_The_Third_Man: cannot identify image file <_io.BytesIO object at 0x7f98a0b1bf40>
4
5
6
7
7_All_About_Eve: cannot identify image file <_io.BytesIO object at 0x7f98a0a0e590>
8
9
10
10_Metropolis_(1927_film): cannot identify image file <_io.BytesIO object at 0x7f98a0867450>
11
12
13
14
14_Singin'_in_the_Rain: cannot identify image file <_io.BytesIO object at 0x7f98a07f6180>
15
15_Boyhood_(film): 'image'
16
17
18
18_Psycho_(1960_film): cannot identify image file <_io.BytesIO object at 0x7f98a0886680>
19
19_Laura_(1944_film): cannot identify image file <_io.BytesIO object at 0x7f98a14066d0>
20
20_Nosferatu: cannot identify image file <_io.BytesIO object at 0x7f98a143ad60>
21
22


API error: {'code': 'invalidtitle', 'info': 'Bad title "A_Hard_Day%27s_Night_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


22_A_Hard_Day%27s_Night_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=A_Hard_Day%2527s_Night_%28film%29
23
24
24_North_by_Northwest: cannot identify image file <_io.BytesIO object at 0x7f98a0890360>
25
26
27
27_The_Maltese_Falcon_(1941_film): cannot identify image file <_io.BytesIO object at 0x7f989b9eba40>
28
29
30
31
32
32_King_Kong_(1933_film): cannot identify image file <_io.BytesIO object at 0x7f98a08a19a0>
33
33_Spotlight_(film): https://en.wikipedia.org/w/api.php?action=query&exintro&formatversion=2&inprop=url|watchers&list=random&pithumbsize=240&pllimit=500&ppprop=disambiguation|wikibase_item&prop=extracts|info|links|pageassessments|pageimages|pageprops|pageterms|redirects&redirects&rdlimit=500&rnlimit=1&rnnamespace=0&titles=Spotlight%20%28film%29&plcontinue=43842546|0|Satellite_Award_for_Best_Supporting_Actr

API error: {'code': 'invalidtitle', 'info': 'Bad title "Rosemary%27s_Baby_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


72_Rosemary%27s_Baby_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=Rosemary%2527s_Baby_%28film%29
73
74
75
76
77
78
78_Gone_with_the_Wind_(film): cannot identify image file <_io.BytesIO object at 0x7f98a088cb80>
79
80
81
82
82_Tokyo_Story: cannot identify image file <_io.BytesIO object at 0x7f98a140d040>
83
84
85
86
87
88
89
90
90_On_the_Waterfront: cannot identify image file <_io.BytesIO object at 0x7f98a0b8e220>
91
91_The_Wages_of_Fear: cannot identify image file <_io.BytesIO object at 0x7f989b7669f0>
92
93
94
94_The_Grapes_of_Wrath_(film): cannot identify image file <_io.BytesIO object at 0x7f98a0a38c20>
95
95_Roman_Holiday: cannot identify image file <_io.BytesIO object at 0x7f989b9114f0>
96
97
98
99
100
100_Battleship_Potemkin: cannot identify image file <_io.BytesIO object at 0x7f98a08a1f40>


In [27]:
for key in image_errors.keys():
    print(key)

2_Citizen_Kane
3_The_Third_Man
6_The_Cabinet_of_Dr._Caligari
7_All_About_Eve
10_Metropolis_(1927_film)
13_It_Happened_One_Night
14_Singin'_in_the_Rain
15_Boyhood_(film)
18_Psycho_(1960_film)
19_Laura_(1944_film)
20_Nosferatu
22_A_Hard_Day%27s_Night_(film)
24_North_by_Northwest
27_The_Maltese_Falcon_(1941_film)
28_Repulsion_(film)
31_Sunset_Boulevard_(film)
32_King_Kong_(1933_film)
33_Spotlight_(film)
34_The_Adventures_of_Robin_Hood
35_Rashomon
40_Argo_(2012_film)
43_Bride_of_Frankenstein
45_M_(1931_film)
47_The_Philadelphia_Story_(film)
50_Seven_Samurai
51_The_Treasure_of_the_Sierra_Madre_(film)
53_12_Angry_Men_(1957_film)
54_The_400_Blows
56_All_Quiet_on_the_Western_Front_(1930_film)
57_Army_of_Shadows
60_A_Streetcar_Named_Desire_(1951_film)
61_The_Night_of_the_Hunter_(film)
65_Frankenstein_(1931_film)
66_Vertigo_(film)
68_Touch_of_Evil
69_The_Babadook
71_Rebecca_(1940_film)
72_Rosemary%27s_Baby_(film)
76_The_39_Steps_(1935_film)
82_Tokyo_Story
88_High_Noon
90_On_the_Waterfront
92_The

In [6]:
# Inspect unidentifiable images and download them individually
for rank_title, images in image_errors.items():
    if rank_title == '22_A_Hard_Day%27s_Night_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/4/47/A_Hard_Days_night_movieposter.jpg'
    if rank_title == '53_12_Angry_Men_(1957_film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/9/91/12_angry_men.jpg'
    if rank_title == '72_Rosemary%27s_Baby_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/e/ef/Rosemarys_baby_poster.jpg'
    if rank_title == '93_Harry_Potter_and_the_Deathly_Hallows_–_Part_2':
        url = 'https://upload.wikimedia.org/wikipedia/en/d/df/Harry_Potter_and_the_Deathly_Hallows_%E2%80%93_Part_2.jpg'
    title = rank_title[3:]
    df_list.append({'ranking': int(title_list.index(title) + 1),
                    'title': title,
                    'poster_url': url})
    r = requests.get(url)
    # Download movie poster image
    i = Image.open(BytesIO(r.content))
    image_file_format = url.split('.')[-1]
    i.save(folder_name + "/" + rank_title + '.' + image_file_format)

ValueError: 'he_Wizard_of_Oz_(1939_film)' is not in list

In [9]:
# Create DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns=['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)
df

Unnamed: 0,ranking,title,poster_url
0,4,Get_Out_(film),https://upload.wikimedia.org/wikipedia/en/a/a3...
1,5,Mad_Max:_Fury_Road,https://upload.wikimedia.org/wikipedia/en/6/6e...
2,6,The_Cabinet_of_Dr._Caligari,https://upload.wikimedia.org/wikipedia/en/2/2f...
3,8,Inside_Out_(2015_film),https://upload.wikimedia.org/wikipedia/en/0/0a...
4,9,The_Godfather,https://upload.wikimedia.org/wikipedia/en/1/1c...
...,...,...,...
57,93,Harry_Potter_and_the_Deathly_Hallows_–_Part_2,https://upload.wikimedia.org/wikipedia/en/d/df...
58,96,Man_on_Wire,https://upload.wikimedia.org/wikipedia/en/5/54...
59,97,Jaws_(film),https://upload.wikimedia.org/wikipedia/en/e/eb...
60,98,Toy_Story,https://upload.wikimedia.org/wikipedia/en/1/13...


## Roger Ebert Word Clouds
- We've now gathered the data to produce our second goal visualization, the Roger Ebert Review Wordcloud

- Next we have to access and clean the data but in this notebook we focus solely on gathering
- The word clouds have been pre-gathered in the word clouds folder
- These word clouds required gathering data from two different sources: downloading files from the internet, i.e the Roger Ebert review text files and accessing data from an API i.e the movie poster urls. And this data was in two formats `.txt` and `JSON`
- Data visualization can be informative but it can also be art

## Stroing Data
- Storing data is usually done after cleaning but it is not usually done which excludes it from being part of the data wrangling process
- Sometimes you just analyze and visualize and leave it at that without saving your new data

- Imagine you've assessed and cleaned your data which includes merging all these separate pieces of data. What do you want to do next?
- There are two popular options, saving to a database and saving to a file. For tabular data like we have here both files like csv files in databases are used a ton. The right solution depends on your dataset and the infrastructure you use in your daily work
- Saving to a csv file is easy and is probably the best solution for a simple dataset like this one
- The `to_csv` dataframe method is all you need and the only parameter required to save a file on your computer is the filepath to which you want to save this file. Often specifying `index=False` is necessary too if you don't want the dataframe index showing up as a column in your stored dataset
    ```python
    df.to_csv('dataset.csv', index=False)
    ```
- Saving data in a database is advantageous though, they are fast, they scale well as your data grows in size and you can ask them questions using languages like SQL. THey are very powerful and database skills are hugely in demand in todays workplace

## Relational Database Structure
- A database is an organized collection of data that is structured to facilitate the storage, retrieval, modification and deletion of data
- There are two main types of databases; relational databases and non relational databases with relational being the most popular
- Structured Query Language, SQL is the standard language for communicating with relational databases

### Why do Data Analysts use Relational Databases ans SQL?
- Alot of the world's data live in databases and most of the world's databases are accessed using SQL
- SQL is the most common method for accessing data and databases today. It has a variety of functions that allows its users to read, manipulate and change data

- It's popular for data analysis because;
    * It's easy to understand and learn
    * It can access data directly from where its stored
    * Its easy to audit and replicate
    - It can run queries on multiple tables at once, across large datasets
    * It can answer complex questions with more detail and depth compared to other analytic tools

### Why do Businesses choose relational databases and SQL
- Nearly all applications have a need to store data so that it can be accesed later and SQL is the language that allows analysts and others to access that information
- Databases have a number of attributes that make them great for this:
    * They check for data integrity to make sure that data is entered in the appropriate foramt
    * They are fast across large datasets and can be optimized for greater speed
    * They are shared entities meaning many people can access the same data at the same time
    * They have administrative features like access controls

- [Cornell: Relational Databases - Not your Father's Flat Files](https://www.cac.cornell.edu/education/Training/DataAnalysis/RelationalDatabases.pdf)

### How Relational Databases Store Data
- If you've used excel you should already be familiar with tables, they are similar to spread sheets.Tables have rows and columns just like Excel but have some more rigid rules

- Database tables for instance are organized by column:
    * Each column must have a unique name
    * In a spreadsheet each cell can have its own data type but in database tables all the data in a column must be of the same type
    * Descriptive column names are important
- While the data type must be consistent the database doesn't necessarily know that a number means latitiude or the text is a name of a company that's why descriptive number

### Types of SQL statements
- SQL has a few basic elements. The most basic of which is a statement;
    * statement - think of it as a piece of correctly written SQL code to tell the database what you want to do
    * `CREATE` - creates a new table in the database
    * `DROP TABLE` - removes a table from the database
    * `SELECT` - allows you to read and display data, aka queries

    ##### `SELECT` and `FROM`
    - Important questions of every query start with two mandatory clauses or command words;
        * `FROM` - What data do you want to pull from? What table of data do you want to use
        * `SELECT` - Which elements from the database do you want to pull? What columns do you want to pull from the table


## Relational databases in Python
### Data Wrangling and Relational Databases
- In the context of data wrangling, databases and SQL shold only come into play for gathering and/or storing data. That is
    * __1. Connecting to a database and importing data__ into a pandas DataFrame then assessing and cleaning that data
    * __2. Connecting to a database and storing data__ you just gathered (which could be potentially from a database), assessed and cleaned

- These tasks are especially necessary when you have large amounts of data which is where SQL and other databases excel over flat files
- The two scenarios above can be further broken down into three main tasks:
    * __1. Connecting to a database in python__
    * __2. Storing data from a pandas dataFrame in a database to which you are connected__
    * __3. Importing data from a database to which you are connected to a pandas DataFrame__

- 

- In the steps below we are going to;

     * Connect to a database. We'll connect to a SQLite database using [SQLAlchemy](https://www.sqlalchemy.org/) a database toolkit for python
     * Store the data in the cleaned master dataset in the database. We'll do this using pandas `to_sql` DataFrame method
     * Then read the brand  new data in that database back into a pandas DataFrame. We'll do this using pandas `read_sql` function.

### Relational databases and pandas

In [10]:
import pandas as pd

In [12]:
df = pd.read_csv(
    '/home/mark/data_science/Udacity/data_analysis/2-introduction_to_data-analysis/case_studies/second/fuel-economy-datasets/all_alpha_08_clean.csv')


In [13]:
df.head(3)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,SUV,6,15,20,17,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,17,22,19,5,no


### connect to a database

In [19]:
from sqlalchemy import create_engine

In [20]:
# Create SQLAlchemy Engine and empty bestofrt database
# bestofrt.db will not show up in the Jupyter No
engine = create_engine('sqlite:///all_alpha_08_clean.db')

### Store pandas DataFrame in a database
- Store the data in the cleaned master dataset (bestofrt_master) in the database

In [21]:
# Store cleaned master DataFrame ('df') in a table called master in bestofrt.db
# bestofrt.db will be visible now in the Jupyter Notebook dashboard
df.to_sql('master', engine, index=False, if_exists='replace')

### Read database data into a pandas DataFrame
- Read the brand new data in that database back into a pandas DataFrame

In [22]:
df_gather = pd.read_sql('SELECT * FROM master', engine)

In [23]:
df_gather.head(3)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,SUV,6,15,20,17,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,17,22,19,5,no


## Data Wrangling in SQL?
- Data wrangling can actually be performed in SQL
- Pandas is better eqipped for gathering (pandas has a huge simplicity advantage in this area), assessing and cleaning data so its recommended to use pandas

- [Reddit thread that debates pandas vs SQL](https://www.reddit.com/r/Python/comments/1tqjt4/why_do_you_use_pandas_instead_of_sql/)