# Pandas

By the end of this lecture, we'll be ready to import datasets from the Internet for analysis.

The pandas module provides all kinds of functionality for dealing with tables of data.  Python incorporates ideas from many places in its design, but Pandas was clearly inspired by the R programming language.



One big idea that made it into Pandas is that columns and rows in datasets want to be labeled; a plain array that just has the data but no labels is error-prone and harder to understand at a glance.


In [None]:
import pandas as pd
import numpy as np

# Hotel ratings on a 5-star scale
my_data = np.array([[5, 5, 4], [2, 3, 4]])
print(my_data)
# Contrast with the labeled Pandas version
df2 = pd.DataFrame(my_data, columns = ["Hilton", "Marriott", "Four Seasons"])
df2.set_index([pd.Index(["Alice rating", "Bob rating"])])


DataFrames also have functionality for dealing with more messy data, with missing values and different data types mixed in the same overall "package" of data.  Numpy arrays, by contrast, expect all values to be present and of the same type.

# .tsv and .csv files

The data that we'll import to DataFrames will very often be .tsv files - "Tab Separated Value" files - or .csv files - "Comma Separated Value."

.tsv and .csv are easily readable as text, as almost the whole format is described in the name.  .csv files contain the values in the table separated by commas (or newline to go to a new row), while .tsv files contain the data values separated by tabs (or newline to go to a new row).  A simple format makes it easy for anyone to read or edit the data.

Examples of what each file type looks like if you open it (CSV then TSV):

In [None]:
5,5,4
2,3,4

In [None]:
5    5    4
2    3    4

# Importing the data as a dataframe

The fundamental object that you interact with in Pandas is a DataFrame, which you can think of as a highly annotated array with mixed data types.  We'll first load a file into a dataframe.

If you're running in Google Colab, you'll first need to upload the file to Google colab.  Here is how to do that.  You can skip this cell if you're working locally with Jupyter notebook.

In [None]:
# Skip this cell if not working in Google Colab
from google.colab import files

uploaded = files.upload() # pick starbucks_drinkMenu_expanded.csv

The above code should create a menu where the user can upload a file into Google colab's space.

Once you upload the file in that menu, you should be able to see the file with the system command ls, which lists files in a directory.  We put a ! before the command to indicate that we're using a system command instead of Python.

In [None]:
!ls

Once that's done, or if you're working locally and have the necessary file in your current directory (the one where you launched Jupyter notebook), you can read the CSV file into a dataframe as follows.  (The head() method displays only the first few rows of a DataFrame, making it helpful for previewing files to see whether you have the data you expect.)

In [None]:
import pandas as pd
df = pd.read_csv('starbucks_drinkMenu_expanded.csv', index_col = 'Beverage')
df.head()

One argument to read_csv that we haven't explained is index_col.  This determines which column will also be used to index into the data easily.  It makes the most sense to make this the column with the most specific "names" for the rows, which is Beverage in this example.  The loc property lets us look up values by index and column, even if the index doesn't uniquely identify a row.   (Notice that it uses square brackets instead of parens, hence its not being a "method" but a "property.")

In [None]:
df.loc['Brewed Coffee','Calories'] # will retrieve the different sizes' calories

But we could have chosen a different column to be the index.

In [None]:
df2 = pd.read_csv('starbucks_drinkMenu_expanded.csv', index_col = 'Beverage_category')
df2.head()

In [None]:
df2.loc['Coffee', 'Calories']

In [None]:
import pandas as pd

# Constructors

We can get a better appreciation of the structure of these tables if we look at a constructor.  A numpy array can provide much of the data, but it has neither column names nor an index column.  If we supply both, the data is easier to read.  We can supply column names in the constructor and provide index values immediately after with set_index().

In [None]:
import numpy as np

# Hotel ratings on a 5-star scale
my_data = np.array([[5, 5, 4], [2, 3, 4]])
df2 = pd.DataFrame(my_data, columns = ["Hilton", "Marriott", "Four Seasons"])
df2 = df2.set_index([pd.Index(["Alice", "Bob"])])
df2

Strictly speaking, we don't need the column names or the index for our DataFrame.  We can create a DataFrame from a numpy array that doesn't have either column or index names.  We can access elements using the iloc property, which relies on numbers instead of strings.

In [None]:
df3 = pd.DataFrame(my_data) # plain numpy array as input
df3.head()

In [None]:
df3.iloc[1,0]

A plain numpy array wouldn't work to build a table of mixed types, like having strings and integers both in the table, so DataFrames also have a constructor that takes a dictionary where the keys are column names and the values are columns, given as lists of equal length.

In [None]:
my_data_dict = {
    "Gender" : ["F", "M"],
    "Hilton" : [5, 2],
    "Marriott": [5, 3],
    "Four Seasons": [4,4]
}
df4 = pd.DataFrame(my_data_dict)
df4 = df4.set_index([pd.Index(["Alice", "Bob"])])
df4

# Slicing the table

Suppose we want to iterate over all the items in a column or row.

It's easy to get a whole column to iterate over - just write tablename[columnname], as if the DataFrame were a dictionary and the column name a key.  When a column is removed from the table like this, it becomes a Series object.  Series can be iterated over using for loops; they're like annotated lists, or one-dimensional DataFrames.

In [None]:
df4['Hilton']

In [None]:
sum = 0
for i in df4['Hilton']:
    sum += i
print('Average Hilton Rating: ' + str(sum/len(df4['Hilton'])))

To get sections of the table that aren't simply columns, we can use the loc and iloc properties mentioned previously, which access values by names of the rows and columns (loc) or numbers of the rows and columns (iloc) respectively.   Both have a syntax of df.loc[row, col] or df.iloc[row,col].  : (colon) reprises its role that says "Give me all of these," resulting in a Series if it's one-dimensional.

In [None]:
df2

In [None]:
df2.loc[:, "Marriott"]

In [None]:
df2.loc["Bob",:]

In [None]:
df2

In [None]:
df2.iloc[:, 2]

In [None]:
df2.iloc[0,:]

In [None]:
df.loc["Brewed Coffee",:]

In [None]:
df.loc[:,'Beverage_prep']

Iterating over one of these slice results with a for loop works the way you would expect.

In [None]:
# Iterating over Series returned by indexing
biggest_c = 0
calorie_list = df.loc["Brewed Coffee","Calories"] # 4 items
for c in calorie_list:
  if c > biggest_c:
    biggest_c = c
print(biggest_c)

Series objects can often be treated as numpy arrays and behave in the way you'd expect.  Here we show an arbitrary example of this - we can multiply a whole Series by 2 without a loop.


In [None]:
df.loc['Brewed Coffee', 'Calories'] * 2

# Exercise

Write code that finds the highest amount of sugar (Sugars_g) among all the products in the Starbucks table.  You don't have to name the item.

In [None]:
# TODO

# Indexing with conditions

We can also index into the dataframe using expressions that evaluate to true or false for the cell.  The inner expression, "df['Calories'] > 300" creates a table full of true or false values.  (A numpy array would do something similar.)  Passing this to df again results in a smaller table with just the cells that evaluated to True.

In [None]:
(df['Calories'] > 300)

In [None]:
df[df['Calories'] > 300].head()

It's also possible to link together multiple criteria, in order to ask for all soymilk beverages that have more than 300 calories, for example.  The one catch is that you can't use normal boolean operators for this - that results in a type error, because those operators are for the single boolean values we were generating earlier in the course, not tables full of boolean values that need to be compared elementwise.  The operators &, |, and ~ act as and, or, and not in this context, respectively.

In [None]:
df[(df['Calories'] > 300) & (df['Beverage_prep'] == 'Soymilk')]

# Exercise

Try creating a DataFrame that contains all beverages with no trans fat ("Trans_Fat_g") and more than 2 grams of protein ("Protein_g")

In [None]:
# TODO

# Convenient mathematical functions

The Series objects that make up the columns of a DataFrame have a variety of convenient functions.

For example, there's an existing implementation of max() that would have made our previous efforts more concise.  And while max() finds the maximum value, idxmax() finds the index of that value - the name of the item instead of the value.

In [None]:
print(df.loc[:, "Protein_g"].mean())
print(df.loc[:, "Protein_g"].max())
print(df.loc[:, "Protein_g"].idxmax()) # "argmax," gives index with biggest value

In fact, you can get the min, max, mean, median, standard deviation, and some percentiles all in one go using the describe() method.

In [None]:
df.describe()

In [None]:
df.describe()

The percentages give the 25th percentile, 50th percentile (median), and 75th percentile, i.e. 25% of the calorie values are 120 or less, and so on.  This gives a nice numerical first pass over the data that helps to give context to individual entries, like knowing 300 calories is a lot for a drink, or 4 grams is below average for a drink's protein.

# DataFrames Day 2

# Correlation

One mathematical function that is particularly interesting that Pandas makes very accessible is the correlation.  A correlation is a number between -1 and 1 that measures the extent to which two variables covary, with 1 for rising and falling in perfect sync, -1 for one variable always rising as the other falls, and 0 for no (linear) relationship between the variables.  Pandas will show the correlation of every pair of numerical columns in a table with the df.corr() function.

In [None]:
df.corr()

In general, a correlation's absolute value of 0.4 or less can be considered weak, an absolute value of 0.6 or more can be considered strong, and correlations between can be considered moderate.  However, this can depend on what's being studied - if the subjects are unpredictable people in a psychology experiment, a moderate result might be considered strong.

While the correlation function can perform three kinds of correlation - Pearson, Spearman's, and Kendall -- the default of Pearson is what people usually are referring to when they talk about correlations.

In [None]:
df.describe()

Notice that, for the describe() and corr() tables, we're actually missing some columns that we would consider numerical, like 'Vitamin_A'.  This is where additional data cleaning becomes necessary.

# Checking the column names and types

Real data may be formatted somewhat annoyingly - if the whitespace doesn't match exactly, you may not succeed in naming the column you want.  Examining the .columns field of the dataframe can help you get the proper names for things.  (This file already had its names cleaned up, but did have excess whitespace before.)

In [None]:
df.columns

There may also be columns that seem to be mostly numerical, but a stray string value causes the whole column to be interpreted as strings.  Looking at the dtypes field can help catch these.  In our file here, a "Varies" note under "Caffeine_mg" causes the column to be interpreted as general objects instead of numbers, while a % sign makes Vitamin A and other nutrients read as strings.

In [None]:
df.dtypes

We can fix Vitamin A as an example.  Its problem is that it isn't interpreted as numeric, and therefore isn't getting its describe() or corr() stats computed, because of a percent sign at the end of each value.  If we strip the last character from each string, we are then able to convert each string to a number.

A way to strip off a character from a string named 'string' is to slice it with the last index -1, thus string[0:-1].  The -1 signals the character at the end of the string, and making that the second index drops the last character.

In [None]:
string = 'string'
string[0:-1]

The str property of a Series allows us to call string-related methods on every string in the Series.  So we can use our character-dropping technique on the whole Vitamin A column.  We can then replace the whole column by assigning our result to the original column.

In [None]:
df['Vitamin_A'] = df['Vitamin_A'].str[0:-1] # Remove the % at the end
df['Vitamin_A']

Last, we can convert the whole column to numerical using pd.to_numeric(), again assigning the whole column result to the column.

In [None]:
df['Vitamin_A'] = pd.to_numeric(df['Vitamin_A'])
df.dtypes

Since the Vitamin A column is now numerical, it now shows in our statistics.

In [None]:
df.describe()

# Drop missing values

Another common issue with data is that it's just not there.  Pandas makes it relatively straightforward to drop rows with missing data, which is represented by NaN ("not a number") in Pandas.  You can detect missing data with isnull(), which returns booleans for every place, and drop relevant rows or columns with dropna().

In [None]:
df.isnull().sum()

In [None]:
df = df.dropna(axis=0, how="any") # Remove the offending row
df.isnull().sum()

# Named tuples

It's possible to iterate over the rows of a DataFrame, treating each row as a *named tuple*.  So, before we describe how to iterate like that, what are named tuples?



Named tuples are little bit more structured than normal tuples. A namedtuple allows us to name each place in the tuple. Then, when accessing that place, we can use the name instead of [0] or [1]. An example follows.

In [None]:
from collections import namedtuple

BillItem = namedtuple("BillItem", ["name", "price"])

item = BillItem("sushi", 10)
print(item.price)

Notice how "item.price" is much more readable than item[1].  A named tuple thus adds a little readabilty without adding the complexity that would come with making it a full-fledged object.

# itertuples() and iterrows()

We may want access to a whole row record as we iterate down the table.  This is made more convenient with the itertuples() generator.  Placed in a for loop, it formats one row at a time as a named tuple, where a value can be accessed with a dot followed by the column name.

In [None]:
# Find the name and size of the beverage with the most calories
calorie_max = 0
best_name = ""
for row in df.itertuples():
  if row.Calories > calorie_max:
    calorie_max = row.Calories
    best_name = row.Index      # index is the beverage

print(best_name)


Iterrows() does something similar, but its fields can be accessed with strings instead of the dot notation.

In [None]:
# Find the name and size of the beverage with the most calories
calorie_max = 0
best_name = ""
for index, row in df.iterrows():
  if row["Calories"] > calorie_max:
    calorie_max = row['Calories']
    best_name = index

print(best_name)

# Exercise

Using iterrows() or itertuples(), try finding the drink under 200 calories that offers the most protein.

In [None]:
# TODO

# Convenient statistical visualization tools

In addition to convenient mathematical tools for DataFrames, there are some easy functions for visualizing data.

In [None]:
protein = df.loc[:, "Protein_g"]
protein.hist(bins=20) # Create a histogram with 20 equally spaced bins for the data

In [None]:
subplot = df[["Protein_g"]] # Notice another way to get desired column
subplot.boxplot() # Boxplots give median value, middle 50% of data, and range of non-outliers

# Putting it all together

We'll try putting it all together with a new dataset, the Titanic dataset.  This dataset has survival data on all the passengers on the Titanic.  It also has information like the gender of the passenger and how much they paid for a ticket.

We can explore this data with the question, are there variables that tend to predict survival for certain passengers?  Specificallly, did gender, age, or class matter in who survived?

The data is in a CSV file, so we can load that up first.

In [None]:
# Skip this cell if not working in Google Colab
from google.colab import files

uploaded = files.upload() # pick titanic.csv

Some relevant variables to us are Survived (1 if they did, 0 if they didn't), Pclass (1st, 2nd, or 3rd class), Sex, and Age.  The variables we'll ignore are SibSp (how many siblings or spouses on the Titanic), Parch (number of parents or children on the Titanic), Ticket number, the fare paid, the cabin number, and the port of embarkation.

In [None]:
import pandas as pd
df = pd.read_csv('titanic.csv', index_col = 'PassengerId')
df.head()

We next make sure there's nothing unexpected about the column names or the types of these columns.  Aside from sex being a string instead of a more convenient number, everything here is as expected.

In [None]:
df.columns

In [None]:
df.dtypes

What is the overall average survival rate?  The average age?  We can use describe() to get a sense of the numeric variables.

In [None]:
df.describe()

Now, what was correlated with survival?  We can use df.corr() to see.

In [None]:
df.corr()

It looks like age doesn't have much relationship with the survived variable, but Pclass has a weak negative correlation with it (third class has lower survival than first).

Sex isn't in the table because it's not numeric.  But we can compute the survival rates separately for men and women.

In [None]:
males = df[df['Sex'] == 'male']
males

In [None]:
males.describe()

In [None]:
females = df[df['Sex'] == 'female']
females.describe()

That's certainly a larger survival rate for women.  Can we compute a correlation?

We can, as long as we create a new column for our table with numeric values instead of strings.  We can create a boolean column of True and False for df['Sex'] == 'female', then add this column to our dataframe by assigning it to a new column name.  The boolean values will be interpreted as 0 and 1 for the correlation.

In [None]:
df['sex_numeric'] = df['Sex'] == 'female'

In [None]:
df.corr()

That's a pretty strong correlation, the strongest in the table.

Circling back to the class of cabin, we could visualize the three survival rates using the built-in histogram method.

In [None]:
third_class = df[df['Pclass'] == 3]
second_class = df[df['Pclass'] == 2]
first_class = df[df['Pclass'] == 1]
third_class['Survived'].hist()

In [None]:
second_class['Survived'].hist()

In [None]:
first_class['Survived'].hist()

# More to come

We'll return to all this later - DataFrames have quite a lot that they can do.