# Pandas and some Census data

Based on teaching materials from Lam Thuy Vo at NICAR19 to do stuff with a U.S. Census CSV file using Pandas.

* [Slides](https://docs.google.com/presentation/d/1ZG-IC33qL6dOk-WfwMuiyfzLb-Th1I_XoGNFPpbq8Ps/)
* [GitHub repo](https://github.com/lamthuyvo/python-data-nicar2019)
* [Pandas user guide](http://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)

In [None]:
# import the pandas library and give it a "nickname," pd, to be used when you call a pandas function
import pandas as pd

In [None]:
# import a CSV file's data as a Pandas dataframe
# we assign the dataframe to the variable census_data, but some people would prefer to name it df 
census_data = pd.read_csv('2016_census_data.csv')

In [None]:
# look at first 5 rows of dataset with a function called .head()
# be sure to scroll rightwards to see more columns
census_data.head(5)

In [None]:
# look at last 5 rows of dataset with a function called .tail()
# note the row numbers in the first column
census_data.tail(5)

In [None]:
# view 5 random rows - the number can be more or less than 5, up to you
# if you shift-enter in this cell more than once, you'll get different rows each time
census_data.sample(5)

What we are looking at is called a **dataframe.** What we are doing is **exploring** the dataframe to get an idea of how much data we have and what it looks like. We can also read the column headings to understand what data we have.

In [None]:
# we can flip the data so that the column headings are in column 1 and all the rows become columns
# it does not stay this way - we are only looking 
census_data.T

In [None]:
# see a list of all column headings 
census_data.columns

In [None]:
# see all the data types in your dataframe 
census_data.dtypes

`object` is used for text (string), and `int64` means the data in that column is an integer. Sometimes the data is not formatted correctly, and you need to change the data type in a column. Other data types include `float64`, `bool`, and `datetime64[ns]`.

In [None]:
# use normal Python len() to see how many rows, how many columns in the dataframe
print(len(census_data))
print(len(census_data.columns))

In [None]:
# or use a pandas command, shape, to get that info about the dataframe
census_data.shape

In [None]:
# see first five cells in 'county' column
census_data['county'].head(5)

In [None]:
# see ALL cells in 'county' column - but not really all - first 30 and last 30
census_data['county']

In [None]:
# create a Python list with the names of only the columns you want to view
column_names  = ['county', 'total_population', 'median_income',
       'educational_attainment']
# now view only those columns - use .sample() to get random rows
census_data[column_names].sample(10)

In [None]:
# pandas assigns each row a number as an index automatically
# we could assign each row a different index, but not necessary
# if I know I want to look at only the data in the row with index 350 - 
census_data.iloc[350]

It's called "integer-location based indexing," so that's why the function is named `iloc`. [Learn more here.](https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/) 

In [None]:
# combining .iloc[] and column-name search
census_data['name'].iloc[350]

In [None]:
# or using that list from before and a different index number
census_data[column_names].iloc[4]

In [None]:
# do math things on ONE column only
# note - if we don't PRINT these, we'll see only the last one 
print(census_data['black_alone'].mean())
print(census_data['black_alone'].median())
print(census_data['black_alone'].sum())

What we got from the previous cell was the mean, median, and sum (total) of ALL values in the entire column named "black_alone" - for all 4,700 rows! 

In [None]:
census_data['black_alone'].describe()

The meaning of each line in the previous result is explained [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html). `count` is how many rows were analyzed (4,700). `mean` is the same as the cell before, where we got mean, median, and sum. `mean` is the same as "average." `std` is the [standard deviation](https://www.robertniles.com/stats/stdev.shtml). `min` is the lowest value in the column - here, 0 means that some Census tracts have zero black residents. The three percentiles show us what proportion of rows contain a value equal to or less than that number - so 25% of the Census tracts have 47 or fewer black residents. Finally, `max` tells us the highest value in this column.

In [None]:
# we can assign an entire column to a variable, and then use that variable to make comparisons
asian_population = census_data['asian']
asian_population.describe()

In [None]:
# now we assign a different column to a different variable
black_population = census_data['black_alone']

In [None]:
# compare the means of the two columns
# python greater than
black_population.mean() > asian_population.mean()

In [None]:
# reverse
asian_population.mean() > black_population.mean()

In [None]:
# show unique values in a given column
census_data.county.unique()

Remember that each of these counties has MANY Census tracts, so the county names appear many times in the `county` column. The previous cell lets us see how many counties are in the dataframe, with no duplications.

In [None]:
# sort them alphabetically and print one per line 
for county in sorted(census_data.county.unique()):
    print(county)

In [None]:
# find rows where a string matches the value in given column
# we want to find out how many rows are for Pike County
pike = census_data.loc[(census_data['county'] == 'Pike County')]
# count how many Pike County rows
len(pike)

You can see from previous results that there are definitely cells containing the string "Pike County" - so why does `len()` come back with 0? It means that there is no match, and that means there must be some **invisible characters** in the string - such as spaces, line endings, or tabs. Python has a method for striping those characters off the start or end of a string.

In [None]:
# data in this column is dirty, so we strip spaces and invisible characters with .str.strip(' \t\n\r') 
pike = census_data.loc[(census_data['county'].str.strip(' \t\n\r') == 'Pike County')]
# count how many Pike County rows
len(pike)

That's better. And now that we realize the county cells contain "dirty data," we might want to simply clean all of them up at once. It's best to preserve the original column (for safety) and create a new column that has the clean data in it.

In [None]:
# to create a new column with clean county names - 
# 'county_clean' is the NEW column 
census_data['county_clean'] = census_data['county'].str.strip(' \t\n\r')

In [None]:
# create a Python list with just the columns you want
column_names  = ['name', 'county', 'state', 'total_population', 'median_income', 'median_home_value']

# using the pike variable from a previous cell - 
# sort the rows with highest median_home_value at top, lowest at bottom
pike[column_names].sort_values('median_home_value', ascending=False)

In the output above, notice the **total population** for the tract with highest median home value.

In [None]:
# if I want to use my new 'county_clean' column instead -
column_names  = ['county_clean', 'total_population', 'median_income', 'median_home_value', 'name']

# note, I changed the column order in that list!! 
passaic = census_data.loc[(census_data['county_clean'] == 'Passaic County')]
passaic[column_names].sort_values('median_home_value', ascending=False)

**Note** how you can display the columns in ANY ORDER you desire. Just make a new list for `column_names` and use that to show the data. Include only the columns you want to see, in the order you want.

But wait, there's more! What if you want to save that data in that exact format - maybe to send it to someone else.

In [None]:
# save that to a NEW CSV file with a new filename
new_dataset = passaic[column_names].sort_values('median_home_value', ascending=False)
new_dataset.to_csv('passaic_only.csv', encoding='utf8')

THAT is seriously powerful. You just extracted **100 rows** from a 4,700-row CSV, threw out 10 of the 16 columns, and put the columns into a different order. Your original CSV remains untouched and intact. You created a new CSV file that you could share with others who do not have Jupyter Notebooks.

In case you're not sure what directory the new file was saved to, enter `pwd` to find out which directory this Jupyter Notebook is running in. (`pwd` is a command that stands for "print working directory.)

In [None]:
pwd

The output from the previous cell shows you where to find the file *passaic_only.csv* on your computer.

In [None]:
# how many rows (Census tracts) does each county have, anyway? 
census_data['county_clean'].value_counts()

**Note** that the list (in the output from the previous cell) is in order from most rows to fewest rows. You can see which counties have a large number of Census tracts in this dataset (which might not be complete for some counties).