Lesson Outline
Data wrangling process:

>Gather (this lesson)  
Assess  
Clean  

__Gathering data__ is the first step in data wrangling. Before gathering, we have no data, and after it, we do.

Gathering data varies from project to project. Sometimes you're just given data, or pointed to it like I do for you throughout this course. Sometimes you need to search for the right data for your project. Sometimes the data you need isn't readily available, and you need to generate it yourself somehow. When you do find your data, it's not unusual for it to be spread across several different sources and file formats, which makes things tricky when organizing the data in your programming environment.

For these reasons and more, gathering can be tricky. In this lesson, which is likely the most technically challenging lesson of the course, you'll acquire the coding skills and general craftiness required to conquer the vast majority of gathering scenarios you'll come across in the future. This is going to be hard sometimes, and that's okay. Stick with it and don't hesitate to reach out for help.

This lesson will be structured as follows:

First, we'll pose a few questions.
Then you'll explore the source of each piece of data we need to answer those questions, each piece from a different source and in a different format.
Then you'll learn about the structure of each file format.
Then you'll learn how to handle that file format using Python and its libraries.
Then you'll actually gather each piece of data to later join together to create your master dataset.

In [1]:
import pandas as pd
import numpy as np

## Flat File Structure
Flat files contain tabular data in plain text format with one data record per line and each record or line having one or more fields. These fields are separated by delimiters, like commas, tabs, or colons.

### Advantages of flat files include:

- They're text files and therefore human readable.
- Lightweight.
- Simple to understand.
- Software that can read/write text files is ubiquitous, like text editors.
- Great for small datasets.

### Disadvantages of flat files, in comparison to relational databases, for example, include:

- Lack of standards.
- Data redundancy.
- Sharing data can be cumbersome.
- Not great for large datasets (see "When does small become large?" in the Cornell link in More Information).

In [2]:
df = pd.read_csv('bestofrt.tsv', sep='\t')
df.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


### -----------------------------------------------------------------------------------------------------------------------------------------------

The two main ways to work with HTML files are:

1. Saving the HTML file to your computer (using the Requests library for example) library and reading that file into a BeautifulSoup constructor
2. Reading the HTML response content directly into a BeautifulSoup constructor (again using the Requests library for example)

In [1]:
import requests

In [3]:
url = 'https://www.rottentomatoes.com/m/et_the_extraterrestrial'
response = requests.get(url)

In [None]:
# save HTML to file
with open()