# 1. Reading data into pandas

Before you can use pandas to analyze some data, you need some data. This might be a file that lives on your computer, a file that lives on the Internet or a collection of data derived from another step in your processing pipeline.

There are several ways you can read data into a pandas dataframe, and you can load many different types of data files, including CSVs and other delimited text files, Excel files [and more](https://www.cbtnuggets.com/blog/2018/10/14-file-types-you-can-import-into-pandas/).

Here are a few of the more common approaches.

First, let's import pandas `as` pd.

In [None]:
import pandas as pd

### From a CSV file

If your data file is delimited with something other than a comma, you'll need to specify that in the `sep` argument. For example, if you had a pipe-delimited file: `pd.read_csv('../data/my-pipe-delimited-file.txt', sep='|')`

Let's read in the MLB salary data.

In [None]:
df_csv = pd.read_csv('../data/mlb.csv')

In [None]:
df_csv.head()

### From a CSV file on the Internet

Just pass in the URL. This example uses [licensed child care facility data from Colorado's open data portal](https://data.colorado.gov/Early-childhood/Colorado-Licensed-Child-Care-Facilities-Report/a9rr-k8mu).

The values that get returned aren't live -- like, if the results changed, your data frame would not update with new values. It reads in the data once.

In [None]:
df_csv_internet = pd.read_csv('https://data.colorado.gov/api/views/a9rr-k8mu/rows.csv?accessType=DOWNLOAD')

In [None]:
df_csv_internet.head()

### From an Excel file

To read an Excel file in pandas, use the [`read_excel()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html) method. Depending on the filetype (`xls` or `xlsx`), you'd also need to separately install into your virtual environment the `xlrd` or `openpyxl` modules. (We've already installed both here.)

You might also want to specify the `sheet_name` to select your worksheet of interest -- the default is "the first one."

Here, we're reading in a spreadsheet with data on accidental drug overdoses in Connecticut.

In [None]:
df_xl = pd.read_excel('../data/CT_Overdoses_2012-2016.xlsx', sheet_name='Accidental_Drug_Related_Deaths_')

In [None]:
df_xl.head()

### From a Python data collection

Maybe the work you're doing in pandas happens downstream of some other Python processing, so the data exists as a native Python data collection -- say, a list of dictionaries. You can turn this (and other Python data collections, like a list of lists) into a pandas dataframe, too.

In [None]:
test_data = [
    {'name': 'Cody Winchester', 'job': 'Director of technology', 'location': 'Spearfish, SD'},
    {'name': 'Guy Fieri', 'job': 'Gourmand', 'location': 'Flavortown'},
    {'name': 'Michael Bennet', 'job': 'Senator', 'location': 'Washington, D.C.'}
]

In [None]:
df_py_lod = pd.DataFrame(data=test_data)

In [None]:
df_py_lod.head()

If you have a list of lists, you would need to also specify the `columns` keyword argument, as well:

In [None]:
test_data_ls = [
    ['Cody Winchester', 'Director of technology', 'Spearfish, SD'],
    ['Guy Fieri', 'Gourmand', 'Flavortown'],
    ['Michael Bennet', 'Senator', 'Washington, D.C']
]

In [None]:
df_py_lol = pd.DataFrame(data=test_data_ls, columns=['name', 'job', 'location'])

In [None]:
df_py_lol.head()

### From an HTML table

OK SO.

This one requires you to install and specify the Python package that has the HTML parsing engine of your choice -- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) or [lxml](http://lxml.de/). The default is `lxml`, but here we're going to use BeautifulSoup.

Huge caveat! Pulling data directly from an HTML table can be hit and miss, depending on how hairy the underlying HTML is. And if you want to scrape data from a website, it's usually better practice to save the results to a local file, _then_ load it up for analysis. But it's good to know that it's an option.

In this example, we've installed `BeautifulSoup` (alias `bs4`) and we're going to import [a table of lead burn instructors](https://www.texasagriculture.gov/Portals/0/Reports/PIR/certified_lead_burn_instructors.html) from the Texas Department of Agriculture website.

We're going to pass three things to [the pandas `read_html()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html):
1. The URL we want to scrape (in quotes, as a string)
2. The `flavor` of parser that we'd like to use to process the HTML (`bs4`)
3. The number of the list, in the list of rows that gets returned in a dataframe, that is the `header`? (Usually it's 0 -- the first one)

Reading through the documentation for this method, we also notice that this method returns a _list_ of matching tables as dataframes, so we need to grab the _first_ item in this list of tables returned. Our arguments were specific enough that there's only one item in the returned list, though, so we can just grab the first item with `[0]`.

In [None]:
html_df = pd.read_html('https://www.texasagriculture.gov/Portals/0/Reports/PIR/certified_lead_burn_instructors.html',
                       flavor='bs4',
                       header=0)[1]

In [None]:
html_df.head()

### From a folder of identically formatted CSVs

Sometimes, rather than one file you need to load, you have a directory of files with the same formatting but different data. Let's talk about a strategy for reading them all into a single dataframe -- the data for this exercise comes from [this wonderful data-driven story from 2019 by C.K. Hickey in _Foreign Policy_](https://foreignpolicy.com/all-the-presidents-meals-state-dinners-white-house-infographic/) on state dinner menus for U.S. presidents (thank you, C.K.!) and can be found in the `../data/state-dinners/` directory.

Our strategy:
- Get a list of these files using [the `glob` module](https://docs.python.org/3/library/glob.html) from the standard library
- Use a fun Python data structure called a ["list comprehension"](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) in conjunction with the pandas methods `read_csv()` (which we've seen before) and [`concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) (which we have not)

First, we need to import `glob` before we can use it. (n.b.: The customary thing to do is drop all your imports at the top of your script.)

In [None]:
import glob

Get a list of the files using wildcards:

In [None]:
sd_files = glob.glob('../data/state-dinners/*.csv')

In [None]:
sd_files

In human language: Go to the `glob` module we just imported and use its `glob` object to get a list of files based on the path and filename wildcards we hand it.

Now let's talk for a sec about **list comprehensions**. Let's say you had a list of items that you wanted to _do_ something to -- some math, some filtering, some reading into dataframes, whatever. One of the main uses for list comprehensions is effeciently "saving" the results of this operation to a new variable.

Here's a simple example -- let's say we had the following list of numbers:

In [None]:
number_list = [1, 2, 3, 4, 5, 6]

... and we want to end up with a list of numbers that is each of these numbers multiplied by 10. We could do something like this:

In [None]:
new_list = []
for x in number_list:
    new_list.append(x*10)

In [None]:
new_list

You could achieve the same thing with a _list comprehension_ much quicker and easier:

In [None]:
new_list_lc = [x*10 for x in number_list]

In [None]:
new_list_lc

Here, `x` is a placeholder for each item in the list, same as the variable defined in the `for` loop.

That's basically what we're going to do here -- instead of creating an empty list, looping over each file in the `state_dinners` directory, creating a new dataframe, adding it to the list, then concatenating all those dataframes, we can do it all in one fell swoop:

In [None]:
df_dinners = pd.concat([pd.read_csv(x) for x in sd_files])

Reading this from the inside out as a human sentence: Take each CSV file in the `state_dinners` directory, which we found earlier using the `glob` tool, and read it into a (more or less temporary) dataframe -- then take all of those dataframes and concatenate them together into one dataframe.

In [None]:
df_dinners.head()

## Inspecting your data

Once you have your data in a data frame, your first order of business is to answer some basic questions about the data itself -- things you might want to put in your data diary, such as:
- What's the shape of the data? (How many rows, how many columns?)
- How many blank/null values are there in each column?
- Did each column import as the correct type of data? (Text, number, etc.)
- Are there any duplicate rows?
- What are the most common values in each column? (The Golden Query™️)

Let's take the Colorado child care data as an example (we read this in as `df_csv_internet` earlier).

In [None]:
# take a quick look
df_csv_internet.head()

In [None]:
# or look at the last records
df_csv_internet.tail()

In [None]:
# check the column names
df_csv_internet.columns

In [None]:
# check the data types
df_csv_internet.dtypes

In [None]:
# how many rows, how many columns?
df_csv_internet.shape

In [None]:
# access each of the numbers in the .shape attribute
no_rows = df_csv_internet.shape[0]
no_cols = df_csv_internet.shape[1]

In [None]:
print(no_rows)

In [None]:
print(no_cols)

In [None]:
# alternatively, use len() to check row count
len(df_csv_internet)

In [None]:
# check basic stats of numeric columns
df_csv_internet.describe()

In [None]:
# run .value_counts() against individual columns to grab most common values
df_csv_internet.CITY.value_counts()
# df_csv_internet.STATE.value_counts()
# etc.

... and so on. Another good integrity check is looking at the ranges for numeric/date columns to see if they make sense. Let's try that!

In [None]:
df_csv_internet['EXPIRATION DATE'].max()

Oops! The dates in this column were imported as `object`, which (not a strictly correct definition) is a data type roughly analogous to plain text. So, in order to figure out the date range -- or to do anything with these dates -- we need to convert these values into dates.

We could either go back up to where we imported this data (using the [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) method) and add a `parse_dates` argument, or we can convert the values in our current data frame using the [`to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method.

⚠️ If you didn't know how to do this already, what would be your Google?

In [None]:
df_csv_internet['EXPIRATION DATE'] = pd.to_datetime(
    df_csv_internet['EXPIRATION DATE'], 
    errors='coerce',
    infer_datetime_format=True
)

In [None]:
df_csv_internet.dtypes

In [None]:
# latest date
df_csv_internet['EXPIRATION DATE'].max()

In [None]:
# earliest date
df_csv_internet['EXPIRATION DATE'].min()