# Data Wrangling with Python

In this workshop, we'll dive deep into some techniques for cleaning and re-shaping data in Python using the [pandas](https://pandas.pydata.org/docs/) library. 

Here's what you can expect to practice:
 - working with CSV data from various sources
 - joining datasets on common elements
 - handling text and time series data
 - grouping and reshaping datasets to create plots
 - working with numeric data at different magnitudes

## Research question

**Do tweets by United States Senators that reference COVID-19 correlate with the incidence of COVID-19 cases in their states?**

## Data sources

 - tweets by Senators in the 116th Congress, collected by GW's [SFM project](https://library.gwu.edu/scholarly-technology-group/social-feed-manager)
 - names, states, and social media handles of current and historical US legislators, compiled by [@unitedstates](https://theunitedstates.io/)
 - daily COVID-19 incidence by U.S. state and territory, compiled by _The New York Times_
 - Federal Information Processing Standards (FIPS) codes for the United States

## Setting up

Let's import any libraries we'll need. 

We'll do most of our work in `pandas`, which should be available automatically in a Google Colab environment. 

In [None]:
import pandas as pd

## Loading our main dataset

Our dataset of tweets by US Senators in the 116th Congress was collected from the public Twitter API, but Twitter's use agreement prohibits sharing the data publicly. So you'll need to load the file that I shared with you from your Google Drive.

In [None]:
# This code will make your Google Drive accessible from your Colab Notebook
# You'll need to click the link provided, copy the token, and paste it into the form.
from google.colab import drive
drive.mount('/content/drive')

### Loading a file from Google Drive

1. Run the cell, follow the link, paste the token in the form provided, and press `Enter`.
2. You should see `Mounted at /content/drive` as the output of that cell. 
3. Now click the Folder icon on the right-hand taskbar of your Colab Notebook. Your mounted Google Drive should be in the folder called `drive`. 
4. If you added the file to your Drive, you can find it under `MyDrive`, which contains all the files and folders in your root Drive folder.
5. Find the file called `data-wrangling-workshop_twitter-data.csv`. 
5. Click to the right of the filename and select `Copy path`.
6. Paste this into a new code cell between quotation marks (as a string) and assign it to a new variable.

In [None]:
path_to_twitter_data = '/content/drive/MyDrive/workshops/Data-Wrangling-101/data-wrangling-workshop_twitter-data.csv'

In [None]:
# With the drive mounted and the path assigned to a variable, we should be able
# to open the file with pandas 
# The read_csv method loads a CSV file into a pandas DataFrame
tweets = pd.read_csv(path_to_twitter_data)

### Exploring & cleaning the data

The Twitter API provides a rich set of metadata as well as the text of each tweet. 

Our `DataFrame` has some handy methods we can use to inspect our dataset.

In [None]:
# By default, a DataFrame displays only the first 5 and last 5 rows
tweets

In [None]:
# We can get a list of columns from the .columns attribute.
# But it's sometimes more helpful to see how many rows are null for each column.
# DataFrame.isna() returns a DataFrame where every null value has been replaced
# by the Boolean value True. 
# All other values have been set to False.
# By calling the .sum() method on **that** DataFrame, we can actually get a COUNT
# of the nulls in each column. 
# Columns with 0 have no nulls.
tweets.isna().sum()

In [None]:
# We can also see a list of the unique values in a column with the .unique() method.
# Taking the len() of that list gives a count of unique values.
len(tweets.user_screen_name.unique())

There are only 100 Senators, but more screen names occur in our dataset because many Senators use both a personal Twitter account and an official account.

#### Dropping duplicates

Since we're counting tweets, we don't want any duplicates. A DataFrame has a method for checking for those, too.

`duplicated()` returns `True` if an element has a duplicate anywhere in the DataFrame (or the column, if the method is called on a column), and `False` if it is unique.

For Twitter data, the `id` field should be unique to each tweet.

In [None]:
# The keep=False argument specifies that we return all rows with duplicates.
dupes = tweets.loc[tweets.id.duplicated(keep=False)]

In [None]:
# The duplicates aren't necessarily contiguous, so we sort by the id column to inspect the duplicates.
dupes.sort_values(by='id')

In [None]:
# We can use drop_duplicates to get rid of these duplicates, keeping only the first occurence.
# The subset keyword argument indicates the column in which to identify duplicate values
tweets = tweets.drop_duplicates(subset='id')

### Working with time series 

Our `tweets` dataset has a few timestamp columns. The `parsed_created_at` contains a timestamp designed to be machine-readable. But `pandas` doesn't automatically convert timestamps to Python `datetime` objects, which is the type in Python designed for working with time series.

In [None]:
# The column's dtype tells us the data type.
# The 'O' represents either a string or a column of mixed typed.
tweets.parsed_created_at.dtype

In [None]:
# We can convert these parsed strings to datetime objects easily with pandas.
# Let's assign the converted data to a new column
tweets['tweet_date'] = pd.to_datetime(tweets.parsed_created_at)
tweets.tweet_date.dtype

In [None]:
# Now we can see the first and last date in our dataset using the .max() and .min() functions.
(tweets.tweet_date.min(), tweets.tweet_date.max())

## Enhancing our data

Our `tweets` dataset has a lot of information about each tweet but not very much information about the authors of the tweets. We don't know which account corresponds to the Senator from a given state. We'll need this additional information in order to compare tweets with COVID incidence. 

### Adding US postal codes to Senators' tweets

Below we'll exploit the fact that we can `merge` DataFrames on common elements. We want to add a column to our dataset indicating the Senator's home state.

The @unitedstates project provides extensive metadata on current and past members of Congress in a variety of machine-readable formats. We'll use their `CSV` files for convenience with `pandas`.

We need two different files:
 - The [legislators-historical](https://theunitedstates.io/congress-legislators/legislators-historical.csv) file contains information on members of the 116th Congress who are **no longer serving**.
 - The [legislators-current](https://theunitedstates.io/congress-legislators/legislators-current.csv) files contains informations on those members who are serving in the 117th Congress.

In [None]:
# Below are the urls for each
leg_hist_url = 'https://theunitedstates.io/congress-legislators/legislators-historical.csv'
leg_curr_url = 'https://theunitedstates.io/congress-legislators/legislators-current.csv'

In [None]:
# We load each into a new DataFrame
# read_csv() can read data from a URL as well as a file
leg_hist = pd.read_csv(leg_hist_url)
leg_curr = pd.read_csv(leg_curr_url)

In [None]:
# Since we need data from both, we can concatenate them (stack one on top of the other)
# using pd.concat().
# This works best when the datasets have the same columns
# We can test for that like so, converting the list of columns to a Python set.
set(leg_hist.columns) == set(leg_curr.columns)

In [None]:
# Note that pd.concat() expects its argument to be a Python list.
leg_all = pd.concat([leg_hist, leg_curr])

In [None]:
# We need just a few columns, so let's take a slice of our combined DataFrame.
# The twitter column contains the legislator's Twitter handle.
# The type column indicates whether they are a Senator or Representative.
columns = ['full_name', 'type', 'state', 'twitter']

In [None]:
# We use the .copy() method to avoid warnings from pandas.
# Without .copy(), the slice returns a reference to the original DataFrame.
# Appending .copy() makes a duplicate (a copy in a new location in memory.)
leg_all = leg_all[columns].copy()

In [None]:
# Now let's filter our dataset to keep only Senators
sen_all = leg_all.loc[leg_all.type == 'sen'].copy()

 - We need to enhance the `tweets` dataset with the information from the `state` column of `sen_all`, so that every tweet will be associated with its author's state. 
 - In order to do this computationally, we need a common data element.
 - The best candidate is the Twitter handle or screen name, since it's included in both datasets. Twitter screen names, moreover, are by definition uniquely identifying, fixed strings, like email addresses. 
 - Matching on proper names is much more difficult, since a great deal of variation can occur among representations of a person's name.

We can use pandas indexing, along with the `.isin()` method, to check the Twitter handles in our `tweets` dataset against the @unitedstates dataset.

In [None]:
# Find rows (tweets) where the Twitter screen name does not appear in sen_all's list
filtered = tweets.loc[~tweets.user_screen_name.isin(sen_all.twitter)]
# Count the number of unique screen names in that group
len(filtered.user_screen_name.unique())

We're missing the personal accounts of the Senators but capturing, for the most part, their official Senate accounts. In a "real" research project, we'd want either to have a good methodological justification for excluding the personal accounts, or to add the missing data from other sources.


#### Data joins 

We have two datasets:
 - Tweets by Senators in the 116th Congress. 
 - Metadata on each Senator for all Congresses (we are interested in the **states** they represent).

We want to combine these datasets using the Twitter handle/screen name as the common element. The resulting dataset should include:
 - The data from the `tweets` dataset **for every Senator with a matching screen name** in the @unitedstates dataset.
 - The states for those Senators (from the @unitedstates dataset).

It should exclude:
 - Tweets from accounts not listed in the @unitedstates dataset.
 - States for Senators not in the `tweets` dataset.

To create the new dataset, we will use a DataFrame's `merge()` method. The type of merge we need in this case is called `inner`, which is the default.

Above we used `pd.concat()` to combine two datasets by gluing one to the bottom of the other.

`DataFrame.merge()` does something different. It produces a new dataset by splicing together rows from each dataset where the rows share a column element. 

In [None]:
# Here's a small example to help reason about how merge works.
df1 = pd.DataFrame({'keys': ['Raoul', 'Abdul', 'Emily', 'Brett'],
                    'values_from_1': ['Python', 'Python', 'R', 'Java']})
df2 = pd.DataFrame({'keys': ['Raoul', 'Raoul', 'Emily', 'Emily', 'Emily', 'Abdul'],
                    'values_from_2': [.95, .92, .99, .98, 1, .95]})

In [None]:
df1

In [None]:
df2

Let's say `df1` contains some data about students: their names and their programming languages of choice. 

`df2` contains more data about some of those students: their most recent test scores.

 - Our shared elements reside in the `keys` column, since (some) of the same names appear in both datasets. (We assume that each name refers to the same student in both sets, and that each name is unique. Otherwise, our merge will create ambiguity.) The `on` argument to `merge` indicates the name of the column that contains the shared elements.
 - The `values` columns contain the data we want to merge (to associate via the shared elements in `keys`).
 - By performing an `inner` merge, we retain only those rows where there is a match in both sets. As a result, `Brett` drops out of our merged dataset, since no scores are recorded for that name.
 - Where common elements repeat in one of the datasets (as some of the names do in `keys`), the merged elements from the other dataset will repeat, too.

In [None]:
df1.merge(df2, on='keys')

In [None]:
# We'll use the same approach to merge our tweets dataset to our sen_all dataset.
# Merges can be (computationally) expensive, so it will be faster to remove
# rows and columns we don't need. 
sen_states = sen_all[['twitter', 'state']].copy()

In [None]:
# We'll drop rows for Senators without Twitter accounts (that's most of them, since this is a historical dataset).
sen_states = sen_states.loc[~sen_states.twitter.isnull()].copy()

In [None]:
# Because our shared element -- the Twitter handle -- belongs to two columns with different names,
# we specify each as a parameter to the merge() method.
# left_on refers to the DataFrame whose method we're calling.
# right_on refers to the DataFrame we're passing as an argument to the former.
tweets_states = tweets.merge(sen_states, 
                             left_on='user_screen_name', 
                             right_on='twitter')

In [None]:
# Let's see what percentage of our original data we were able to keep
len(tweets_states) / len(tweets)

In [None]:
# It's crucial that we don't end up with any duplicate rows, since that could throw off
# our counts.
# The elements in the id column (from the Twitter dataset) should be unique. 
# We can compare these to the length of our new dataset as a whole.
len(tweets_states) == len(tweets_states.id.unique())

#### Applying a test

A `DataFrame` comes with many built-in methods that produce `True/False` values from conditions applied to the elements of a dataset. We've used `isna` and `duplicated` thus far. 

We can also define custom functions. Let's write one to test whether a given tweet mentions the pandemic and then apply it to the text of our tweets.

In [None]:
def is_pandemic(tweet):
  ''':param tweet: a string representation of a single tweet.'''
  # First lowercase the tweet for consistency
  tweet = tweet.lower()
  # Now test for the presence of certain key words
  return ('covid' in tweet) or ('coronavirus' in tweet) or ('pandemic' in tweet)

In [None]:
# We could run the function like this for a single tweet
is_pandemic(tweets_states.iloc[0].text)

In [None]:
# We can use the .apply() method to run our function against every element in the text column.
tweets_states['about_pandemic'] = tweets_states.text.apply(is_pandemic)

Finally, let's make our **enhanced** Twitter dataset a little smaller, keeping just the columns we need.

In [None]:
columns = ['state', 'tweet_date', 'user_screen_name', 'about_pandemic']
tweets_states = tweets_states[columns].copy()

In [None]:
tweets_states.to_csv('/content/drive/MyDrive/workshops/Data-Wrangling-101/data-wrangling_twitter-cleaned.csv', index=False)

## Loading our secondary dataset

To compare COVID-related Senatorial tweets and COVID cases by state, we need data about COVID cases. The _New York Times_ provides a clean, concise dataset of cumulative case totals by date and state.

I've modified this dataset to include the postal code for each state, since that's how @unitedstates represents each Senator's state. (The _New York Times_ dataset uses the full state name, along with the FIPS code. I merged this [FIPS dataset](https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_fips_master.csv) with the _New York Times_ dataset in order to add the postal codes.

In [None]:
cases_url = 'https://raw.githubusercontent.com/gwu-libraries/gwlibraries-workshops/master/data-wrangling-with-python/nyt_covid19_with-postal-code_020621.csv'
cases = pd.read_csv(cases_url)

In [None]:
# We should also convert the date column to a Python datetime type.
cases.date = pd.to_datetime(cases.date)

## Aggregating by state and date

- The `cases` dataset contains one row per state for each date.
- The `tweets_states` dataset may have multiple rows per state and date, since each state has two Senators, and each Senator might have tweeted multiples times on a given day.
- To compare the two, we want to aggregate them at the same level. 
- Thus, we want to compute how many tweets about COVID occurred on each date for each state.

The `DataFrame`'s `groupby` method is a good fit for this use case.

In [None]:
# We can group by multiple columns by passing the method a list
tweets_grp = tweets_states.groupby(['tweet_date', 'state'])

In [None]:
# We can use the groups property to inspect the groups in a groupby statement
# There's a problem here: it's grouping by the full timestamp, not just the date.
tweets_grp.groups

In [None]:
# Luckily, our datetime column (tweet_date) has some attributes we can use to take only a part of the timestamp
# tweet_date.dt.date will yield the "date" portion, excluding the time
tweets_states.tweet_date = tweets_states.tweet_date.dt.date

In [None]:
tweets_grp = tweets_states.groupby(['tweet_date', 'state'])

In [None]:
# Our groups look better now -- the date has been normalized to midnight for each group 
tweets_grp.groups

In [None]:
# To COUNT the number of COVID-related tweets per group, we can apply the .sum() method
# to the about_pandemic column.
tweets_counts = tweets_grp.about_pandemic.sum()

In [None]:
tweets_counts.loc[tweets_counts > 0]

These numbers are relatively small; it's evident that not all Senators were tweeting about COVID every day. 

Our COVID case data from the _New York Times_ shows incidence per state as a cumulative function. We should make our Tweets count cumulative, too.

In [None]:
# The cumsum() method will create a rolling or cumulative sum across the rows of a DataFrame or Series.
tweets_counts.cumsum()

In [None]:
# But applying it to the result of our groupby operation does not yield exactly the result we need.
# Our dataset on COVID cases shows cumulative totals BY STATE. 
# The above is showing the cumulative total OVERALL, as we can see by comparing with the original tweets dataset.
tweets_states.about_pandemic.sum()

### Multi-level grouping

We need a cumulative sum **by state**. The numbers to be summed are those indicating relevant tweets per date **within each grouping by state**. 

To do that, we need to group _the result of our `groupby` operation_, and we need to group it by `state`. 

But our result is not longer a `DataFrame`! It's a `Series` (like a single column of a DataFrame), but it has what's called a _hierarchical index_ or a _multi-index_. 

In [None]:
# We can still use groupby, but now we're grouping on a level of the index, not a column
tweets_counts_grp = tweets_counts.groupby('state')

In [None]:
# And we can apply cumsum directly to the result, since the original object being grouped
# is a Series, not a DataFrame
tweets_csum = tweets_counts_grp.cumsum()

In [None]:
tweets_csum

In [None]:
# We can check our result against the original tweets_states dataset
tweets_states.loc[tweets_states.state == 'WV'].about_pandemic.sum()

### Putting it all together

We're now ready to combine our `tweets_csum` dataset, which shows cumulative totals of tweets about the pandemic by state, with our dataset of COVID cases (also cumulative by state). 

The result should be a dataset with a shared time-series axis, arranged by state. 

This time, however, we want don't want to do an inner join. We're working with cumulative time-series data, and we don't want to introduce gaps into our dataset if, for instance, there are days when no Senators tweeted anything. 

We'll merge our `cases` dataset with our `tweets_csum` dataset using a **left** join. That means **every row** from the left-hand dataset (`cases`) will be present in the result. Gaps in the right-hand dataset will be represented by `NaN` (null) values.


In [None]:
# This error indicates that one of our two time series columns isn't of a datetime type.
# But didn't we convert them?
cases.merge(tweets_csum, left_on=['date', 'postal_code'],
            right_on=['tweet_date', 'state'],
            how='left')

In [None]:
# We can't access levels of hierarchical index by name (as we would with columns)
# The date elements represent the outermost or first level (zero-indexed) of our index.
# Our grouping by tweet_date.dt.date evidently converted the values back to strings.
tweets_csum.index.levels[0]

In [None]:
# We can convert them back
date_index = pd.to_datetime(tweets_csum.index.levels[0])
tweets_csum.index = tweets_csum.index.set_levels([date_index,
                            tweets_csum.index.levels[1]])

In [None]:
# The how parameter indicates that this should be a left join
merged = cases.merge(tweets_csum, left_on=['date', 'postal_code'],
            right_on=['tweet_date', 'state'],
            how='left')

In [None]:
# Because it's a left join, the result should have the same number of rows as the left-hand dataset.
len(merged) == len(cases)

In [None]:
# Now what about those null values?
merged.loc[merged.about_pandemic.isnull()]

#### The nuisance of nulls

Sometimes `NaN` or null values pose special problems. In other cases, they represent valid data points. 

**Exercise**: How should we handle the nulls in this case, if we want to compare cumulative cases and tweet counts side by side? Should we keep them? Get rid of them? Do something else?

**Answer**

We can actually take two approaches.

1. Nulls are present in the tail of the `about_pandemic` column because the `cases` dataset covers a longer span of time than the `tweets` dataset. (The 116th Congress ended on January 7, 2021.) We can safely remove these rows, since we don't have the data to compare.

In [None]:
# Let's keep only those rows that fall within the range of dates in our original tweets dataset
merged = merged.loc[merged.date <= pd.to_datetime('01-07-2021')]

2. Other nulls are present where there are no tweets on a given date in the `tweets` dataset. We can use the `DataFrame.fillna()` method to pad the nulls.

  Padding them will replace each null with the datum immediately preceding it. In a sorted dataset with a cumulative metric, this is a good solution.

  But be careful: our data are cumulative **by state**. So we need to respect the boundaries between the statewise groupings.

In [None]:
# groupby() to the rescue again!
# fillna() takes a method parameter. "pad" is one of the methods available.
padded_about_pandemic = merged.groupby('state').about_pandemic.fillna(method='pad')

In [None]:
# Unlike the sum() and cumsum() methods, this function doesn't yield an aggregation.
# The fillna() method returns the original column to which it was applied, but with the nulls padded.
# The groupby operation ensures that this padding was separately for each state.
merged.about_pandemic = padded_about_pandemic

### Visualizing time series data

All of this effort has been necessary to produce a dataset of two variables that we can actually compare. 

One way to compare them is to plot them both as functions of time. 

Since the totals are cumulative by state, it's straightforward to look at one state at a time.

In [None]:
ny = merged.loc[merged.postal_code == 'NY']

In [None]:
# We can plot multiple variables on the same line graph by passing a list to the "y" parameter
ny.plot(x='date', y=['cases', 'about_pandemic'])

#### Data of different magnitudes

Because the case totals are so many magnitudes greater than the number of tweets produced by Senatorial accounts, our plot isn't very illuminating. Relevant differences in the cumulative total of tweets will effectively be "smoothed over" as a function of the scale necessary to visualize the other variable.

We can use a [logarithmic scale](https://en.wikipedia.org/wiki/Logarithmic_scale) to compare these variables a little more easily.

In [None]:
# We import a matplotlib library to help with formatting.
import matplotlib.ticker as ticker
# Here we use the Axes object returned by DataFrame.plot() to set some properties.
ax = ny.plot(x='date', y=['cases', 'about_pandemic'])
# We can set the type of scale directly on the Axes object.
ax.set_yscale('log')
# Let's use something more readable than scientific notation (the default)
ax.yaxis.set_major_formatter(ticker.EngFormatter())
# Let's add a title
ax.set_title('COVID-19 cumulative cases vs. US Senate tweets: New York')

### Exporting our new dataset

If we want to come back to our analysis later, we can export our combined dataset to CSV. 

Since we've already mounted Google Drive in this notebook, if we use our DataFrame's `to_csv` method, we should be able to save directly to our drive.

In [None]:
# The path should start with /content/drive/MyDrive, and then whatever folder(s) you want to put it in.
path_to_merged_data = '/content/drive/MyDrive/workshops/Data-Wrangling-101/data-wrangling-workshop_merged.csv'

In [None]:
# In this case, we set the optional index parameter to False.
# Otherwise, our CSV will include the index, which consists only of row numbers.
# In other cases, we may want to keep the index.
# If, for instance, we were saving our tweets_csum Series to CSV (since the index has important information).
merged.to_csv(path_to_merged_data, index=False)

## Wrapping up

We've just scratched the surface with our analysis of this dataset, but we have something now with two variables from two separate datasets, aligned along the same time-series axis. 

I hope you've found this lesson useful. Here are some additional resources with techniques for cleaning data in Python.

 - [Pythonic Data Cleaning with pandas and numpy](https://realpython.com/python-data-cleaning-numpy-pandas/)
 - [Data Cleaning in Python](https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d) (and similar tutorials on [towards data science](https://towardsdatascience.com/)
 - [Cleaning Data in Python (U. of Toronto Libraries)](https://mdl.library.utoronto.ca/technology/tutorials/cleaning-data-python)
 - [Python Data Cleaning Cookbook (Packt Publishing)](https://wrlc-gwu.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma99186142810404107&context=L&vid=01WRLC_GWA:live&lang=en&search_scope=WRLC_P_MyInst_All&adaptor=Local%20Search%20Engine&tab=WRLC&query=any,contains,Python%20Data%20Cleaning%20Cookbook)
 - Many other books and videos available via [O'Reilly Books Online](https://www.safaribooksonline.com/library/view/temporary-access), with content free to GW students, faculty, and staff
