# Class 9 Exercises 

These exercises will help you practice the skills and concepts that you learned in today's class.

To get participation credit for today's class, make sure that you work on these exercises and then submit a screenshot or PDF of your work to the appropriate assignment page in Canvas.

___

## Datasheets for Datasets
### Am I The Asshole? Reddit Posts

The dataset that we're working with in this lesson is taken from the subreddit ["Am I The Asshole?"](https://www.reddit.com/r/AmItheAsshole/) — only posts that had more than 2,000 upvotes.

___

## Import Pandas

To use the Pandas library, we first need to `import` it.

In [2]:
import pandas as pd

## Change Display Settings

By default, Pandas will display 60 rows and 20 columns. I often change [Pandas' default display settings](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to show more rows or columns.

In [3]:
pd.options.display.max_colwidth = 200

## Get Data

Read in the data "top-readdit-aita-posts.csv" and save it as the variable `reddit_df`

In [322]:
reddit_df = pd.read_csv('top-reddit-aita-posts.csv', encoding='utf=8', delimiter=',',
                        #parse_dates=['full_date'] 
                        # We will save this for later!
                       )

Calculate descriptive statistics for all columns in the data

In [13]:
# Your code here

Unnamed: 0,author,full_date,date,title,selftext,url,subreddit,upvote_score,num_comments,num_crossposts
count,2932,2932,2932,2932,2929,2932,2932,2932.0,2932.0,2932.0
unique,2859,2932,306,2932,2834,2932,1,,,
top,[deleted],2020-07-24 19:13:49+00:00,2020-06-11,AITA for kicking my cousin off of my sister’s wedding Zoom call?,[removed],https://www.reddit.com/r/AmItheAsshole/comments/hx80wd/aita_for_kicking_my_cousin_off_of_my_sisters/,AmItheAsshole,,,
freq,9,1,27,1,92,1,2932,,,
mean,,,,,,,,7818.791269,1350.299795,0.171896
std,,,,,,,,7582.587863,1273.432164,0.700773
min,,,,,,,,2001.0,1.0,0.0
25%,,,,,,,,2775.0,545.0,0.0
50%,,,,,,,,4267.5,927.0,0.0
75%,,,,,,,,10704.25,1674.75,0.0


- What is the minimum and maximum upvote score?
- Who is the most frequent author?
- How many rows are blank in the "selftext" column?

## Rename Column

Rename the column "selftext" as simply "text"

In [None]:
# Your code here

## Fill NA

Fill in the NaN values in the column "text" with the words "No text"

In [289]:
reddit_df['text'] = reddit_df['text'].fillna('No text')

## Filter By String Match

Search for words or phrases in the Reddit data. Can you find any patterns or trends?

In [None]:
word_filter  = reddit_df['text'].str.contains('wedding', case=False, na=False)
reddit_df[word_filter]

## Write to CSV

In [None]:
reddit_df[word_filter].to_csv('Filtered-Reddit-Data.csv', index=False)

## Applying Functions

In [314]:
def make_lower(text):
    return text.lower()

In [None]:
reddit_df['text'].apply(make_lower)

In [304]:
def df_make_lower(df):
    return df['text'].lower()

In [None]:
reddit_df.apply(df_make_lower, axis='columns')

## Add a Ratio Column

Getting "ratio'd" in social media parlance usually means that you receive more comments than upvotes (or retweets, etc.). Let's make a function that will calculate a post's "ratio."

In [24]:
def calculate_ratio(df):
    comments = df['num_comments']
    upvotes = df['upvote_score']
    
    return comments / upvotes

Apply the function above to the entire DataFrame below

In [26]:
reddit_df['ratio'] = reddit_df...#Your code here

What might be another way to create this new column?

In [28]:
reddit_df['ratio2'] = reddit_df...#Your code here

Sort the DataFrame from the highest ratio to the lowest

In [None]:
reddit_df...#Your code here

## Converting to Datetime

| **Pandas Data Convert Method** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `pd.to_datetime(df['column_name'], format='&Y-%M')`         | Convert column to datetime, specify date input format                                               |

In [None]:
reddit_df['full_date']

In [None]:
pd.to_datetime(reddit_df['full_date'])

In [326]:
reddit_df['full_date'] = pd.to_datetime(reddit_df['full_date'])

We can use `.dt` (datetime) to extract specific parts of the date.

In [None]:
pd.to_datetime(reddit_df['full_date']).dt.date

In [179]:
reddit_df['date'] = pd.to_datetime(reddit_df['full_date']).dt.date

In [None]:
pd.to_datetime(reddit_df['full_date']).dt.time

In [None]:
reddit_df['full_date'].dt.year

In [None]:
reddit_df['full_date'].dt.month

In [None]:
reddit_df['full_date'].dt.second

## Plot By Date

Often it's useful to add a dummy column called "count" with 1s for every row if we're interested in the frequency of values over time.

In [332]:
reddit_df['count'] = 1

In [None]:
reddit_df.groupby('full_date')['count'].sum().plot()

In [None]:
reddit_df.groupby('date')['count'].sum().plot()

## Datetime Index

| **Pandas Datetime Index Methods** | **Explanation**                                                                                   |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| `df.resample('M')`         | Resample, or essentially group by, different spans of time, e.g., `Y`, `M`, `D`, `17min`                                                |
| `df.loc['2018':'2019']`         | Index by label and slice DataFrame between the years 2018 and 2019                             |

Another common approach to working with time series infromation is to make the datetime column into our Pandas index. There are some special things we can do with a datetime index, such as slice the data by dates more efficiently.

Let's make the "Date" column our index.

In [329]:
reddit_df = reddit_df.set_index('full_date')

In [None]:
reddit_df

Additionally, because we can use `.loc` to index by date label, we can get only values from 2018.

In [None]:
reddit_df.loc['2018']

We can also slice between dates.

In [None]:
reddit_df.sort_index().loc['2018-05':'2018-10']

Another thing we can do with a Datetime Index is to `.resample()`, or essentailly group by, different time period spans.

In [None]:
reddit_df.resample('M').sum()

In [None]:
reddit_df.resample('M').sum().plot()

In [None]:
reddit_df.resample('M')['count'].sum().plot()

In [None]:
reddit_df.resample('D')['count'].sum().plot()

In [None]:
reddit_df.resample('Y')['count'].sum().plot()

In [None]:
word_filter = reddit_df['selftext'].str.contains('summer', na=False, case=False)

# Plot all Reddit posts over time
reddit_df.resample('D')['count'].sum().plot(kind='area')
#Plot only posts that contain a givern word
reddit_df[word_filter].resample('D')['count'].sum().plot(kind='area')