# COMP 370 Homework 1 – Mini Data Science Project

## 1. Data Collection

a. Download the raw tweet data. You will ONLY be using the data from the first file

In [1]:
import pandas as pd
df = pd.read_csv('/Users/oliviapereira/Desktop/IRAhandle_tweets_1.csv')

b. Looking at only the first 10,000 tweets in the file, keep those that (1) are in English and (2) don’t contain a question. This will be our dataset. To filter the 
right tweets out, take a look at the columns.

i. There are specific columns that call out our language. You can trust these.

ii. Assume that a tweet which contains a question contains a “?” character.

In [2]:
df_10k = df.iloc[:10000]
print('shape after cutting to 10k:', df_10k.shape[0])
df_filtered = df_10k[(df_10k['language'] == 'English') & (~df_10k['content'].str.contains('\?'))] 
print('shape after filtering:', df_filtered.shape[0])

shape after cutting to 10k: 10000
shape after filtering: 4802


c. Create a new file (I would suggest in TSV – tab-separated-value - format) containing these tweets.

In [3]:
df_filtered.to_csv('/Users/oliviapereira/Desktop/filtered_tweets.tsv', sep='\t', index=False)

## 2. Data Annotation

a. To do our analysis, we need to add one new feature: whether or not the tweet mentioned Trump. This feature “trump_mention” is Boolean (=”T”/”F”). A tweet mentions Trump if and only if it contains the word “Trump” (case-sensitive) as a word. This means that it is separated from other alphanumeric letters by either whitespace OR non-alphanumeric characters (e.g., “anti- Trump protesters” contains “trump”, but “I got trumped” does not).

In [7]:
df_annotated = df_filtered.copy()        # make sure we're not working on a copy of the original
df_annotated['trump_mentioned'] = 'F'    # set the default value to false

# setting the value to true if the content contains the word 'Trump'
df_annotated.loc[df_annotated['content'].str.contains('Trump'), 'trump_mentioned'] = 'T'

b. Create a new version of your dataset (e.g., a CSV/TSV file) that contains this additional feature.

In [8]:
df_annotated.to_csv('/Users/oliviapereira/Desktop/annotated_tweets.tsv', sep='\t', index=False)

## 3. Analysis

a. Using your newly annotated dataset, compute the statistic: % of tweets that mention Trump.

In [16]:
num_mention_trump = (df_annotated['trump_mentioned'] == 'T').sum()
print('number of tweets mentioning Trump:', num_mention_trump)

# computing the statistic
percentage_mention_trump = num_mention_trump / df_annotated.shape[0] * 100
print(f'percentage of tweets mentioning Trump: {percentage_mention_trump:1f}%.')

number of tweets mentioning Trump: 190
percentage of tweets mentioning Trump: 3.956685%.


b. It turns out that our approach isn’t counting tweets properly ... meaning that some tweets are getting counted more than once. Go through and look at your annotated data. Identify where the counting problem is coming from.

In [20]:
with open('README.md', 'w') as f:
    f.write('There are instances of identical retweets or quotes tweets that are being counted multiple times. There are seem to be some issues where the same tweet is entered multiple times, but might have a slight correction and was reuploaded.')

## Submission Instructions
To be considered complete, your submission should contain the following and some non-trivial attempt to provide a solution.
- README.md
    - In 3 sentences or less, explain where the counting problem is coming from.
- dataset.tsv
    - This should be the output of your Data Annotation phase.
    - Format is tab-separated value, utf-8 (as long as you don’t do anything fancy, it will be in utf-8)
    - The first line should be a header line
    - The file should contain the following columns, in this order: tweet_id, publish_date, content, and trump_mention. Tweets should appear in the same order they appeared in the original file from 538.
- results.tsv
    - Format is tab-separated value
    - The first line should be a header line, with headers “result” and “value”.
    - The second line should contain the result for “frac-trump-mentions”. If necessary, truncate your answer to three decimal places.

In [22]:
# creating final files to be submitted

df_annotated.head()
df_dataset = df_annotated[['tweet_id', 'publish_date', 'content', 'trump_mentioned']]
df_dataset.to_csv('/Users/oliviapereira/Desktop/dataset.tsv', sep='\t', index=False)

results = {
    'result' : 'frac-trump-mentions',
    'value' : round(percentage_mention_trump, 3)
}

df_results = pd.DataFrame([results])
df_results.to_csv('/Users/oliviapereira/Desktop/results.tsv', sep='\t', index=False)