# Outline
**Gathering** data is the first step in data wrangling. Before gathering, we have no data, and after it, we do.

Gathering data varies from project to project. Sometimes you're just given data, or pointed to it like I've done for you throughout this course. Sometimes you need to search for the right data for your project. Sometimes the data you need isn't readily available, and you need to generate it yourself somehow. When you do find your data, it's not unusual for it to be spread across several different sources and file formats, which makes things tricky when organizing the data in your programming environment.

For these reasons and more, gathering can be tricky. In this lesson, which is likely the most technically challenging lesson of the course, you'll acquire the coding skills and general craftiness required to conquer the vast majority of gathering scenarios you'll come across in the future. This is going to be hard sometimes, and that's okay. Stick with it and don't hesitate to reach out for help.

**This lesson will be structured as follows:**

* First, we'll pose a few questions.
* Then you'll explore the source of each piece of data we need to answer those questions, each piece from a different source and in a different format.
* Then you'll learn about the structure of each file format.
* Then you'll learn how to handle that file format using Python and its libraries.
* Then you'll actually gather each piece of data to later join together to create your master dataset.

# Navigating Your Working Directory and File I/O
## Navigating Your Working Directory and File I/O
Before you continue on with this lesson, make sure you are comfortable working with your computer's command line interface to access files and folders, and also with reading and writing to files (i.e. part of File I/O or input/output) in Python. It can be extremely frustrating getting bogged down in these seemingly trivial topics.

## Command Line
For the command line interface, here are three excellent resources that I recommend. Pick whichever suits you best:

* Our short [Linux Command Line Basics](https://www.udacity.com/course/linux-command-line-basics--ud595) course (for Linux and Mac users)
* [Navigating the Terminal: A Gentle Introduction](https://computers.tutsplus.com/tutorials/navigating-the-terminal-a-gentle-introduction--mac-3855) by Marius Masalar (for Mac users)
* [Command Prompt - How to use the simple, basic commands](http://www.digitalcitizen.life/command-prompt-how-use-basic-commands) by Codrut Neagu (for Windows users)

## File I/O
For reading from and writing to files in Python:

* The ["Reading and Writing Files"](https://classroom.udacity.com/nanodegrees/nd002/parts/762c0200-e8a7-425b-be49-7080cc533c7d/modules/d2268785-db9d-4aaa-ab44-afec79099d7d/lessons/62fec647-9f0e-4551-8752-2139e2d4eb5f/concepts/43991399-3df7-48cf-a10c-792921e1b6bf) concept in Lesson 6 ("Scripting") of our [Prerequisite: Python](https://classroom.udacity.com/nanodegrees/nd002/parts/762c0200-e8a7-425b-be49-7080cc533c7d) course found in the Extracurricular section

Feel free to skip these resources and continue with this lesson if you're familiar already.

# Flat File Structure
Flat files contain tabular data in plain text format with one data record per line and each record or line having one or more fields. These fields are separated by delimiters, like commas, tabs, or colons.

**Advantages of flat files** include:

* They're text files and therefore human readable.
* Lightweight.
* Simple to understand.
* Software that can read/write text files is ubiquitous, like text editors.
* Great for small datasets.

**Disadvantages of flat files**, in comparison to relational databases, for example, include:

* Lack of standards.
* Data redundancy.
* Sharing data can be cumbersome.
* Not great for large datasets (see "When does small become large?" in the Cornell link in More Information).

> More Information
> * [Professor Excel: XML & ZIP: Explore Your Excel Workbooks File Structure](http://professor-excel.com/xml-zip-excel-file-structure/)
> * [Cornell: Relational Databases - Not your Father's Flat Files](https://www.cac.cornell.edu/education/Training/DataAnalysis/RelationalDatabases.pdf)

# Source: Web Scraping
The two main ways to work with HTML files are:

* Saving the HTML file to your computer (using the [Requests](http://docs.python-requests.org/en/master/) library for example) library and reading that file into a `BeautifulSoup` constructor
* Reading the HTML response content directly into a `BeautifulSoup` constructor (again using the Requests library for example)

You'll learn how this Requests code works under the hood shortly in “Downloading Files from The Internet.”

For this lesson, you’re going to do neither of these. I've downloaded all of the Rotten Tomatoes HTML files for you and put them in a folder called rt_html in the Jupyter Notebooks in the Udacity classroom. If you want to work outside of the classroom, **download [this zip file](https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ca6b7b_rt-html/rt-html.zip) and extract the rt_html folder. I recommend that you do and open the HTML files in your preferred text editor (e.g. [Sublime](https://www.sublimetext.com/), which is free) to inspect the HTML for the quizzes ahead.**

The rt_html folder contains the Rotten Tomatoes HTML for each of the Top 100 Movies of All Time as the list stood at the most recent update of this lesson. I'm giving you these historical files because the ratings will change over time and there will be inconsistencies with the recorded lesson videos. Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So just use these HTML files provided to you and pretend like you saved them yourself with one of the methods described above.

> More Information
> * [Towards Data Science: Ethics in Web Scraping](https://medium.com/towards-data-science/ethics-in-web-scraping-b96b18136f01)
> * [David Venturi: Screen scraping was the first "magical" thing that drew me to programming](https://twitter.com/venturidb/status/734757220525715456)

# HTML File Structure
The Hypertext Markup Language (or HTML) is the language used to create documents for the World Wide Web.

Let's turn to [Cameron Pittman](https://blog.udacity.com/2015/03/3-web-developers-built-careers-scratch-part-two-cameron-pittman.html), an instructor and Full Stack Engineer at Udacity, to introduce the basic structure of HTML files. The three short videos below are all you need to know to start web scraping. If you'd like to learn more, or are feeling like there are knowledge gaps you'd like to fill in, I encourage you to check out Cameron's "Intro to HTML and CSS" course. You can find it [here](https://www.udacity.com/course/intro-to-html-and-css--ud304).

As you're following along with the video, open up one of the HTML files you just downloaded in a text editor (like Sublime) and look for similarities in the HTML document that Cameron uses as an example. We'll do this together soon.

# Source: Downloading Files from the Internet
## HTTP (Hypertext Transfer Protocol)
HTTP, the Hypertext Transfer Protocol, is the language that web browsers (like Chrome or Safari) and web servers (basically computers where the contents of a website are stored) speak to each other. Every time you open a web page, or download a file, or watch a video, it's HTTP that makes it possible.

HTTP is a request/response protocol:

* Your computer, a.k.a. the client, sends a request to a server for some file. For this lesson: "Get me the file **1-the-wizard-of-oz-1939-film.txt**", for example. GET is the name of the HTTP request method (of which there are multiple) used for retrieving data.
* The web server sends back a response. If the request is valid: "Here is the file you asked for:", then followed by the contents of the **1-the-wizard-of-oz-1939-film.txt** file itself.
![img](./assets/l2_12.png)

If you'd like to learn more, or are feeling like there are knowledge gaps you'd like to fill in, I encourage you to check out the following videos in our free [Web Development course](https://classroom.udacity.com/courses/cs253) in Lesson 1 ("How the Web Works").

# Text File Structure
## Encodings and Character Sets Articles
[The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) by Joel Spolsky
An excerpt:

> **The Single Most Important Fact About Encodings**

> If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

> **There Ain’t No Such Thing As Plain Text**

> If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

> Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.”

[What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/)

> An article by Joel Spolsky entitled The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a nice introduction to the topic and I greatly enjoy reading it every once in a while. I hesitate to refer people to it who have trouble understanding encoding problems though since, while entertaining, it is pretty light on actual technical details. I hope this article can shed some more light on what exactly an encoding is and just why all your text screws up when you least need it.

> Any character can be encoded in many different bit sequences and any particular bit sequence can represent many different characters, depending on which encoding is used to read or write them. The reason is simply because different encodings use different numbers of bits per characters and different values to represent different characters.”

## Unicode and Python
In Python 3, there is:

* one text type: `str`, which holds Unicode data and
* two byte types: `bytes` and `bytearray`

The Stack Overflow answers [here](https://stackoverflow.com/questions/6224052/what-is-the-difference-between-a-string-and-a-byte-string) explain the different use cases well.

## More Information
* If you’re still confused about the difference between character sets and encoding, check out these articles:
    * [The difference between UTF-8 and Unicode?](http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/)
    * [More About Unicode in Python 2 and 3](http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/)


# Text Files in Python
> More Information
> * [Stack Overflow: Best Practices for Opening Files in Python](https://stackoverflow.com/a/22288895)
> * [Stack Overflow: The Correct, Fully Pythonic Way to Read a File](https://stackoverflow.com/a/8010133)
> * [Stack Overflow: Iterables and Iterators](https://stackoverflow.com/a/16994568)
> * [Wikipedia: Glob programming](https://en.wikipedia.org/wiki/Glob_(programming)

# Source: APIs(Application Programming Interfaces)
## MediaWiki API

MediaWiki has a great [tutorial](https://www.mediawiki.org/wiki/API:Tutorial) on their website on how their API calls are structured. It's a nice and simple example and they explain the various moving parts:

* The endpoint (important takeaway: there is nothing special about this URL!)
* The format
* The action
* Action-specific parameters

Go and read that example and then come back to the classroom.

Done reading? Great! Though they say that is a "simple example," it could definitely be simpler! This is where access libraries, also known as client libraries or even just libraries (as in "Twitter API libraries"), come into play and make our lives easier.

## wptools Library

There are a bunch of different access libraries for MediaWiki to satisfy the variety of programming languages that exist. Here is a [list](https://www.mediawiki.org/wiki/API:Client_code#Python) for Python. This is pretty standard for most APIs. Some libraries are better than others, which again, is standard. For a MediaWiki, the most up to date and human readable one in Python is called [wptools](https://github.com/siznax/wptools). The analogous relationship for Twitter is:

* MediaWiki API → wptools
* Twitter API → tweepy
wptools has an even simpler tutorial on their GitHub page using the [Mahatma Gandhi Wikipedia page](https://en.wikipedia.org/wiki/Mahatma_Gandhi) as a working example.

To get a `page` object, the [usage](https://github.com/siznax/wptools/wiki/Usage#page-usage) is as follows:
```Python
page = wptools.page('Mahatma_Gandhi')
```

...where *'Mahatma_Gandhi'* is the last bit of the Wikipedia URL for that page (https://en.wikipedia.org/wiki/Mahatma_Gandhi). This `page` object has methods that can get us various pieces of data about that Wikipedia page, including all of the images on the page. To get all of the data:

>Simply calling get() on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.
```Python
page = wptools.page('Mahatma_Gandhi').get()
```
Or if you already have a page object assigned to `page`:
```Python
page.get()
```
`page` now has the following attributes, which can be accessed using dot notation through `.data`:
![img](./assets/l2_15.png)

`page.data['image']`, for example, would return a list of data for six images on this specific Wikipedia page.

# JSON FILE Structure
More Information
* [Mashery: API Data Exchange: XML vs. JSON](https://www.tibco.com/blog/2014/01/23/api-data-exchange-xml-vs-json/)

# JSON Files in Python
## More JSON in Python
For the example in this lesson, JSON data was sourced from an API. That isn't always the case, though! Sometimes you're given a text file with human readable JSON within it. For this situation, the [*json*](http://docs.python-guide.org/en/latest/scenarios/json/) library is indispensable. It can parse JSON from strings or files and it can parse JSON into a Python dictionary or list. It can also convert Python dictionaries or lists into JSON strings. The tutorial on the linked documentation page is handy. This [Reading and Writing JSON to a File in Python](http://stackabuse.com/reading-and-writing-json-to-a-file-in-python/) article from Stack Abuse is also great, which outlines `json.dump`, `json.dumps`, `json.load`, and `json.loads` (four key json library methods) well.

pandas also has JSON functions (the `read_json` function and the `to_json` DataFrame method), but the hierarchical advantage of JSON is wasted in pandas' tabular DataFrame so the uses are limited.

# Mashup: APIs, Downloading Files Programmatically, JSON
## Mashup: APIs, Downloading Files Programmatically, and JSON
With APIs, downloading files programmatically from the internet, and JSON under your belt, you now have all of the knowledge to download all of the movie poster images for the Roger Ebert review word clouds. This is your next task.

There are two key things to be aware of before you begin:

##  Wikipedia Page Titles
To access Wikipedia page data via the MediaWiki API with *wptools* (*phew*, that was a mouthful), you need each movie's Wikipedia page title, i.e., what comes after the last slash in **en.wikipedia.org/wiki/** in the URL. For this lesson, I've compiled all of these titles for each of the movies in the Top 100 for you.
![img](./assets/l2_18.png)

## Downloading Image Files
Downloading images may seem tricky from a reading and writing perspective, in comparison to text files which you can read line by line, for example. But in reality, image files aren't special—they're just binary files. To interact with them, you don't need special software (like Photoshop or something) that "understands" images. You can use regular file opening, reading, and writing techniques, like this:
```Python
import requests
r = requests.get(url)
with open(folder_name + '/' + filename, 'wb') as f:
        f.write(r.content)
```
But this technique can be error-prone. It will work most of the time, but sometimes the file you write to will be damaged. This happened to me when preparing this lesson:
![img](./assets/l2_18_1.png)

This type of error is why the requests library maintainers [recommend](http://docs.python-requests.org/en/latest/user/quickstart/#binary-response-content) using the [PIL](https://pillow.readthedocs.io/) library (short for Pillow) and `BytesIO` from the io library for non-text requests, like images. They recommend that you access the response body as bytes, for non-text requests. For example, to create an image from binary data returned by a request:
```Python
import requests
from PIL import Image
from io import BytesIO
r = requests.get(url)
i = Image.open(BytesIO(r.content))
```
Though you may still encounter a similar file error, this code above will at least warn us with an error message, at which point we can manually download the problematic images.

# Storing Data
Storing is usually done after cleaning, but it's not always done, which excludes it from being a core part of the data wrangling process. Sometimes you just analyze and visualize and leave it at that, without saving your new data.

Again, because storing is performed on cleaned data, we could cover this at the end of Lesson 4 ("Cleaning Data"). But since we're covering file formats in this lesson, let's cover it here.

Imagine you've assessed and cleaned your data, which includes merging all of these separate pieces of data, which as I mentioned in the last video I took care of behind the scenes for you. What do you want to do next?

The advantages and disadvantages of flat files were discussed earlier in the lesson in the Flat File Structure concept. One of the advantages:

>Great for small datasets.

And one of the disadvantages:

>Sharing data can be cumbersome.

Given the size of this dataset and that it likely won't be shared often, saving to a flat file like a CSV is probably the best solution. With pandas, saving your gathered data to a CSV file is easy. The `to_csv` [DataFrame method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) is all you need and the only parameter required to save a file on your computer is the file path to which you want to save this file. Often specifying `index=False` is necessary too if you don't want the DataFrame index showing up as a column in your stored dataset. If you had a DataFrame, df, and wanted to save to a file named dataset.csv with no index column:
```Python
df.to_csv('dataset.csv', index=False)
```

# Relational Database Structure
A database is an organized collection of data that is structured to facilitate the storage, retrieval, modification, and deletion of data. There are two main types of databases: relational databases and non-relational databases, with relational being the most popular. SQL, or Structured Query Language, is the standard language for communicating with relational databases.

Let’s turn to [Derek Steer](https://twitter.com/dereksteer), co-founder and CEO of [Mode Analytics](https://modeanalytics.com/) (a company that is building software for SQL-based data analysis), to introduce the basic structure of relational databases, their advantages and disadvantages, and how you can interact with them using SQL. The ~5 minutes of videos and text below are all you need to know for this lesson.

Databases and SQL are topics that deserve a full course. If you'd like to learn more, enroll in the [Data Foundations Nanodegree](https://www.udacity.com/course/data-foundations-nanodegree--nd100) program or the [Data Analyst Nanodegree program](https://www.udacity.com/course/data-analyst-nanodegree--nd002) (if you aren't already) to access Derek's SQL for Data Analysis course.

I've selected a subset of videos from that course most relevant to Data Wrangling for you to preview here. As you’re following along with the video, imagine how the Rotten Tomatoes master dataset would be represented and how you might query it to get information. There is also a SQL Explorer Workspace following the videos where you can make queries on the same PostgreSQL database that Derek mentions.

> More Information
> * [Cornell: Relational Databases - Not your Father’s Flat Files](https://www.cac.cornell.edu/education/Training/DataAnalysis/RelationalDatabases.pdf)

# Relational Databases in Python
## Data Wrangling and Relational Databases
In the context of data wrangling, we recommend that databases and SQL only come into play for gathering data or storing data. That is:

* **Connecting to a database and importing data** into a pandas DataFrame (or the analogous data structure in your preferred programming language), then assessing and cleaning that data, or
* **Connecting to a database and storing data** you just gathered (which could potentially be from a database), assessed, and cleaned

These tasks are especially necessary when you have large amounts of data, which is where SQL and other databases excel over flat files.

The two scenarios above can be further broken down into three main tasks:

* Connecting to a database in Python
* Storing data **from** a pandas DataFrame **in** a database to which you're connected, and
* Importing data **from** a database to which you're connected **to** a pandas DataFrame

## This Lesson
For the example in this lesson, we're going to do these in order:

1. Connect to a database. We'll connect to a SQLite database using [SQLAlchemy](https://www.sqlalchemy.org/), a database toolkit for Python.
1. Store the data in the cleaned master dataset in that database. We'll do this using pandas' `to_sql` DataFrame method.
1. Then read the brand new data in that database back into a pandas DataFrame. We'll do this using pandas' `read_sql` function.

The third one isn’t necessary for this lesson, but often in the workplace, instead of having to download files, scrape web pages, hit an API, etc., you're given a database right at the beginning of a project.

All three of these tasks will be introduced and carried out in the Jupyter Notebook below. These are not quizzes. All of the code is provided for you. Your job is to read and understand each comment and line of code, then run the code.

## Data Wrangling in SQL?
Data wrangling can actually be performed in SQL. We believe that pandas is better equipped for gathering (pandas has a huge simplicity advantage in this area), assessing, and cleaning data, so we usually recommend that you use pandas if given the choice. If wrangling in a work setting, sometimes your tool of choice for data wrangling depends on your company infrastructure, though.

Here is an interesting [Reddit thread that debates pandas vs. SQL](https://www.reddit.com/r/Python/comments/1tqjt4/why_do_you_use_pandas_instead_of_sql/) in general and touches on several topics related to data wrangling.

# Other File Formats
The types of files you mastered in this lesson are the ones you'll interact with for the vast majority of your wrangling projects in the future. Again, these were:

* Flat files (e.g. CSV and TSV)
* HTML files
* JSON files
* TXT files
* Relational database files

Additional, less common file formats include:

* [Excel files](https://www.lifewire.com/what-is-an-xlsx-file-2622540)
* [Pickle files](https://stackoverflow.com/questions/7501947/understanding-pickling-in-python)
* [HDF5 files](http://neondataskills.org/HDF5/About)
* [SAS files](http://whatis.techtarget.com/fileformat/SAS-SAS-program-file)
* [STATA files](http://faculty.econ.ucdavis.edu/faculty/cameron/stata/stataintro.html)

pandas has [functions](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output) to read (and write, to most of them) these files. Also, you now have the foundational understanding of **gathering** and file formats in general, so learning these additional formats won't be too hard if you need them.