In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("Assignment7_COMM187Spring2024.ipynb")

# Coding Assignment #7
## COMM187: Data Science in Communication Research, Spring 2024

**!!! Please make sure to run the first cell before running auto-grader !!!**

<h3><span style="color:green"> Finish assignment on DataCamp </span> </h3>

If you have not yet signed into this class's DataCamp group, please use this invite link and make an account using your UCSB email ID: https://www.datacamp.com/groups/shared_links/0d032623fe95677c03dd5d41331db87feeb0738725bb7ae390f6d9ee17f2bed8 

This week, you have been assigned the following TWO chapters on DataCamp: 
 - **Hypothesis Testing in Python: Two Sample and ANOVA Testing**
 - **Hypothesis Testing in Python: Proportion Tests**

Finish these assigned chapters before proceeding with this assignment. You will need the skills taught in that chapter to solve this week's coding assignment.

---

## <center>TWEETS BY US SENATORS (2008-17)</center>

![](./imgs/assign_07_introimg.png)

The dataset for this assignment is about the likes, replies, and retweets on tweets made by US Senators between 2008 and 2017, as recorded on Oct. 19 and 20, 2017. This dataset was collected and analyzed for the following [blog post by FiveThirtyEight](https://fivethirtyeight.com/features/the-worst-tweeter-in-politics-isnt-trump/), and you can read more information about dataset [here](https://github.com/fivethirtyeight/data/tree/master/twitter-ratio/).

The dataset is stored in the data subfolder and is named `senators.csv`. It contains the following columns:


Header | Definition
--- | ---
`created_at` | Date and time of the creation of the tweet in the format "mm/dd/yy hh:mm"
`text` | The text content of the tweet
`url` | URL of the tweet
`replies` | Number of replies for the tweet when data was collected
`retweets` | Number of retweets for the tweet when data was collected
`favorites` | Number of favorites (or likes) for the tweet when data was collected
`user` | Twitter username of the US Senator
`bioguide_id` | Unique ID for each US Senator
`party` | Party identification for each US Senator: 'D' for Democrat, 'R' for Republican, and 'I' for Independent
`state` | The state being represented by the US Senator

---

In [None]:
# RUN THIS CELL BEFORE ATTEMPTING THE ASSIGNMENT QUESTIONS
import numpy as np
import pandas as pd
from scipy import stats

---

**Question 1: Loading Data**

Use the pandas `read_csv` function to load the data `metro-grades.csv` file into a pandas DataFrame named `df`.

In [None]:
### Write your code below (in place of ...)
df = ...

In [None]:
grader.check("q1")

---

**Question 2: Oldest tweet in the dataset**

Find out the oldest tweet in the dataset and save it in the variable `q2_result`.

Follow these steps: 
 - **Step 1:** Learn how to convert a string to a "date and time" type variable in pandas using the pd.to_datetime() function, as described in [this documentation](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html).
 - **Step 2:** Convert the strings in the column `created_at` and store the output in a new column in `df` called `created_at_new`.
 - **Step 3:** Sort `df` in ascending order and extract the first row using the `head()` function.
 - **Step 4:** Extract the value of `text` column from this row and store it in a string variable `q2_result`.

NOTE: If you simply copy the answer into a string variable without computing the result using the steps outlined above (or another set of equivalent steps), you will get a 0 for this question irrespective of the grade assigned on Gradescope. If you are finding anything challenging, reach out to your classmates on the Discussion Board or your instructors in their office hours :)

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q2")

---

**Question 3: Comparing Favorites on Tweets Across Parties**

**a.** Compare the mean favorites across the three parties (D, R, and I) using ANOVA. Store the F statistic in `q3a_f` and the p-value in `q3a_p`.

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q3a")

**b.** Compare the mean favorites between the two major parties (D and R) using t-test. Store the t-statistic in `q3b_t` and the p-value in `q3b_p`.

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q3b")

---

**Question 4: Comparing the most ratioed Democrat and Republican Senator**

As you might be aware, **"ratio"** on social media refers to the ratio of *replies or comments* to *likes* (as defined [here](https://www.howtogeek.com/721651/what-does-ratio-mean-on-social-media/)). On Twitter, a ratio could mean the ratio of replies to either favorites or retweets. For this question, we will define "ratio" as the ratio of favorites to replies.

It has been observed that high ratio of replies to likes suggests that a post is unpopular or controversial. The reasoning seems to be that if more people are choosing to reply or comment instead of liking the post, they did so to express their disapproval.

**a.** Make a new column in `df` called `ratio` which is calculated by dividing the value in `replies` by the value in `favorites` for each tweet. Wherever the value of `favorites` is 0, the value of ratio should be `NaN`.

*Hint:* Use `np.nan` to create `NaN` value.\
*Hint:* Use the function `np.where` to filter through the data and assign `Nan` value when favorites is 0. Read documentation [here](https://numpy.org/doc/stable/reference/generated/numpy.where.html).

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q4a")

**b.** Find out the MOST ratio'ed tweet in this dataset and store the text of that tweet in a new variable `q4b_result`.

Be sure to only search through rows for which the value of `ratio` is NOT NaN. Use `.notna()` function if needed, see documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html).

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q4b")

**c.** For the top 20,000 ratio'ed tweets, use Chi-Squared test to determine whether the values stored in `party` column occur with these frequencies:

 - D: 0.45
 - R: 0.45
 - I: 0.10 

These values are stored in a dataframed called `hypothesized` as shown below.

Follow these steps to solve this questions. ALL THESE STEPS ARE MANDATORY: 

 - **Step 1:** Create a new dataframe `df_new` which only contains the top 20,000 tweets with the highest ratios.
 - **Step 2:** Create a new dataframe `observed` which stores the number of times 'D', 'R', and 'I' occur in `df_new`. Use `.value_counts()` and `.to_frame()` for this step.
 - **Step 3:** Create a new column in `observed` called `freq` which stores the *frequency* of 'D', 'R', and 'I' in `df_new`. Compute frequencies by diving number of occurences by 20,000.
 - **Step 4:** Perform Chi-Squared test between the observed and hypothesized frequencies using `scipy.stats.chisquare`
 - **Step 5:** Store Chi-squared statistic in a new variable `q4c_chi2` and the p-value in `q4c_pval`.

NOTE: The *Chi-square goodness of fit* test compares proportions of each level of a categorical variable to hypothesized values. In order to successfully solve this question, please refer to solutions of this week's Coding Lab and please finish your assigned chapters on DataCamp!

In [None]:
### Write your code below (in place of ...)
hypothesized = pd.DataFrame({'party': ['R', 'D', 'I'], 'freq': [0.45, 0.45, 0.10]}).set_index('party')
...

In [None]:
grader.check("q4c")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Assignment #7 for COMM187: Data Science in Communication Research. Once finished answering questions, first **SAVE** then download this .ipynb file. Submit the file as instructed on Canvas and Gradescope. **ONLY** submit the .ipynb file, not the zip file.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)