# **CMT309 - Computational Data Science - Data Science Portfolio**

# Part 1 - Text Data Analysis (45 marks)

In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

In this exercise, you are asked to perform a number of operations to gain insights from the data.

## P1.0) Suggested/Required Imports

In [1]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_22.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
  a = f.read()
  outf.write(a.decode('utf-8'))
df = pd.read_csv('data_portfolio_22.csv')
# this fills empty cells with empty strings
df = df.fillna('')

Fetching https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_22.csv


In [3]:
df

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.00,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.00,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.00,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19935,zqrwiel,2020-07-23 16:39:15,11,246,,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,carti why,0,1.00,1883,2014-02-12,0.861626
19936,zqrwiel,2020-12-15 11:25:07,39,1,"Then I think we might get 18 songs, outro usua...",2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,If uzi on track 3 and 16,0,1.00,1883,2014-02-12,0.861626
19937,zqrwiel,2020-12-27 13:57:49,15,1,He has 25songs to perform plus the additional ...,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,Man carti’s concerts are gonna be long af,0,1.00,1883,2014-02-12,0.861626
19938,zqrwiel,2020-12-29 12:07:10,6,1,I got goose[***]ps just by thinking about it 😬,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,Can’t wait to see Carti going full rage mode o...,0,1.00,1883,2014-02-12,0.861626


## P1.1 - Text data processing (20 marks)

### P1.1.1 - Offensive authors per subreddit (5 marks)

As you will see, the dataset contains a lot of strings of the form `[***]`. These have been used to mask (or remove) swearwords to make it less offensive. We are interested in finding those users that have posted at least one swearword in each subreddit. We do this by counting occurrences of the `[***]` string in the `selftext` column (we can assume that an occurrence of `[***]` equals a swearword in the original dataset).

**What to implement:** A function `offensive_authors(df)` that takes as input the original dataframe and returns a dataframe of the form below, where each row contains authors that posted at least one swearword in the corresponding subreddit.

```
subreddit	author
0	40kLore	Cross_Ange
1	40kLore	DaRandomGitty2
2	40kLore	EMB1981
3	40kLore	Evoxrus_XV
4	40kLore	Grtrshop
...
```

In [4]:
def offensive_authors(df):
  # your answer here
  return df[df['selftext'].str.contains("[***]")][['subreddit','author']]
  

In [5]:
offensive_authors(df)

Unnamed: 0,subreddit,author
44,conspiracy,0naptoon
47,conspiracy,0naptoon
51,conspiracy,10100011a10100011a
76,conspiracy,13followsMe
91,conspiracy,2012ronpaul2012
...,...,...
19841,playboicarti,yoda_[***]
19849,playboicarti,yoda_[***]
19930,playboicarti,zqrwiel
19933,playboicarti,zqrwiel


### P1.1.2 - Most common trigrams per subreddit (15 marks)

We are interested in learning about _the ten most frequent trigrams_ (a [trigram](https://en.wikipedia.org/wiki/Trigram) is a sequence of three consecutive words) in each subreddit's content. You must compute these trigrams on both the `selftext` and `title` columns. Your task is to generate a Python dictionary of the form:

```
{subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
subreddit1: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],
...
subreddit63: [(trigram1, freq1), (trigram2, freq2), ... , (trigram3, freq10)],}
```

That is, for each subreddit, the 10 most frequent trigrams and their frequency, stored in a list of tuples. Each trigram will be stored also as a tuple containing 3 strings.

**What to implement**: A function `get_tris(df, stopwords_list, punctuation_list)` that will take as input the original dataframe, a list of stopwords and a list of punctuation signs (e.g., `?` or `!`), and will return a python dictionary with the above format. Your function must implement the following steps in order:

- (**1 mark**) Create a new dataframe called `newdf` with only `subreddit`, `title` and `selftext` columns.
- (**1 mark**) Add a new column to `newdf` called `full_text`, which will contain `title` and `selftext` concatenated with the string `.` (a full stop) followed by a space. That, is `A simple title` and `This is a text body` would be `A simple title. This is a text body`.
- (**1 mark**) Remove all occurrences of the following strings from `full_text`. You must do this without creating a new column:
  - `[***]`
  - `&amp;`
  - `&gt;`
  - `https`
- (**1 mark**) You must also remove all occurrences of at least three consecutive hyphens, for example, you should remove strings like `---`, `----`, `-----`, etc., but not `--` and not `-`.
- (**1 mark**) Tokenize the contents of the `full_text` column after lower casing (removing all capitalization). You should use the `word_tokenize` function in `nltk`. Add the results to a new column called `full_text_tokenized`.
- (**2 mark**) Remove all tokens that are either stopwords or punctuation from `full_text_tokenized` and store the results in a new column called `full_text_tokenized_clean`. _See Note 1_.
- (**2 marks**) Create a new dataframe called `adf` (which will stand for _aggregated dataframe_), which will have one row per subreddit (i.e., 63 rows), and will have two columns: `subreddit` (the subreddit name), and `all_words`, which will be a big list with all the words that belong to that subreddit as extracted from the `full_text_tokenized_clean`.
- (**3 marks**) Obtain trigram counts, which will be stored in a dictionary where each `key` will be a trigram (a `tuple` containing 3 consecutive tokens), and each `value` will be their overall frequency in that subreddit. You are  encouraged to use functions from the `nltk` package, although you can choose any approach to solve this part.
- (**3 marks**) Finally, use the information you have in `adf` for generating the desired dictionary, and return it. _See Note 2_.

Note 1. You can obtain stopwords and punctuation as follows.
- Stopwords: 
```
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
```
- Punctuation:
```
import string
punctuation = list(string.punctuation)
```

Note 2. You do not have to apply an additional ordering when there are several trigrams with the same frequency.

In [6]:
# necessary imports here for extra clarity
from nltk.corpus import stopwords as sw
import string
import warnings
def get_tris(df, stopwords_list, punctuation_list):
  # 1 MARK - create new df with only relevant columns
  newdf = df[['subreddit','title', 'selftext']]
  
  # 1 MARK - concatenate title and selftext
  # 1 MARK for string replacement
  newdf['full_text'] = df.apply(lambda row: f"{row.title}. {row.selftext}", axis=1)
  
  # 1 MARK for regex replacement - remove the strings "[***]", "&amp;", "&gt;" and "https", also at least three consecutive dashes
  prohibitedWords = ['[***]', '&amp;', '&gt;', 'https']
  regex1 = re.compile('|'.join(map(re.escape, prohibitedWords)))
  newdf['full_text'] = newdf.apply(lambda row: regex1.sub("", row.full_text), axis=1)
  newdf['full_text'] = newdf.apply(lambda row: re.sub("/-{3,}/" ,"", row.full_text), axis=1)

  # 1 MARK - lower case, tokenize, and add result to full_text_tokenize
  newdf['full_text_tokenized'] = newdf.apply(lambda row: word_tokenize(row.full_text.lower()), axis=1)
  
  # 2 MARKS - clean the full_text_tokenized column by iterating over each word and discarding if it's either a stopword or punctuation
  newdf['full_text_tokenized_clean'] = newdf.apply(lambda row: [x for x in row.full_text_tokenized if x not in stopwords_list + punctuation_list], axis=1)
  
  # 2 MARKS - create new aggregated dataframe by concatenating all full_text_tokenized_clean values - rename columns as requested
  adf = newdf.groupby('subreddit').agg({'full_text_tokenized_clean': 'sum'})
  
  # 3 MARKS - create new Series object by piping nltk's FreqDist and trigrams functions into all_words
  adf['all_words'] = adf.apply(lambda row: nltk.FreqDist(nltk.trigrams(row.full_text_tokenized_clean)).most_common(10), axis=1)
  
  # 3 MARKS - create output dictionary by zipping subreddit column from adf and tri_counts into a list of tuples, then passing dict()
  # the top 10 most frequent ngrams are obtained by calling sorted() on tri_counts and keeping only the top 10 elements
  out_dict = {}
  for i, row in adf.iterrows():
    out_dict[i] = row['all_words']

  return out_dict
  

In [7]:
# get stopwords as list
sw = sw.words('english')
# get punctuation as list
p = list(string.punctuation)
# optional lines for adding the below line to avoid the SettingWithCopyWarning
warnings.filterwarnings('ignore')
get_tris(df, sw, p)

{'40kLore': [(('--', '--', '--'), 145),
  (('whose', 'bolter', 'anyway'), 8),
  (('started', 'us', 'examples'), 8),
  (('space', 'marine', 'chapter'), 7),
  (('kabal', 'black', 'heart'), 7),
  (('lo', '’', 'tos'), 7),
  (('die', 'paragon', 'knights'), 6),
  (('dark', 'age', 'technology'), 4),
  (('let', "'s", 'say'), 4),
  (('``', 'star', 'claimers'), 4)],
 'AMD_Stock': [(('created', 'subreddit', 'reddit'), 10),
  (('subreddit', 'reddit', 'posts'), 10),
  (('reddit', 'posts', 'r/radeongpus'), 10),
  (('open', 'redditors', 'posts'), 10),
  (('redditors', 'posts', 'well'), 10),
  (('posts', 'well', 'please'), 10),
  (('well', 'please', 'consider'), 10),
  (('please', 'consider', 'subscribing'), 10),
  (('consider', 'subscribing', 'find'), 10),
  (('subscribing', 'find', 'posts'), 10)],
 'Anki': [(('``', 'conditional', "''"), 7),
  (('``', 'field2', "''"), 5),
  (('\\^conditional', 'field1', '/conditional'), 5),
  (('conditional', "''", 'filled'), 4),
  (('font-family', 'simplified', 'ara

## P1.2 - Answering questions with pandas (15 marks)

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Authors that post highly commented posts (3 marks)

Find the top 1000 most commented posts. Then, obtain the names of the authors that have at least 3 posts among these posts.

**What to implement:** Implement a function `find_popular_authors(df)` that takes as input the original dataframe and returns a list strings, where each string is the name of authors that satisfy the above criteria.

In [8]:
def find_popular_authors(df):
  # your answer here
  sorted_df = df.sort_values(by=['num_comments'], ascending=False).head(1000)
  out_list = []
  for i, row in sorted_df.iterrows():
    if row['subr_numb_posts'] >= 3:
      out_list.append(row['author'])  
  return list(set(out_list))

In [9]:
find_popular_authors(df)

['11_gop',
 'ObnoxiousOldBastard',
 'vanish619',
 'ghostmeharder',
 'Gari_305',
 'dsbwayne',
 'kapetankuka',
 'VeganSamura1',
 'bear-rah',
 'kanye5150',
 'lexytheblasian',
 'ryanaire5',
 'WorkTomorrow',
 'bobby_triple',
 'manmeet10',
 'Cicada200',
 'Przemek0980',
 'EyeWikeWocketz',
 'mliash',
 'r[***]og',
 'AlitaBattlePringleTM',
 'ArtistDecor',
 'Sleegan',
 'DaFunkJunkie',
 'AnakinWayneII',
 'mostrandomguy',
 'ZOEXofficial',
 'stealthyfrog',
 'bgny',
 'Romano16',
 'Stoaticor',
 'bored_in_NE',
 'ostrichesarenotreal',
 'factfind',
 'BebeFanMasterJ',
 'IronWolve',
 'InternetCaesar',
 'darkoms666',
 'skuzgang',
 'Molire',
 '[***]reader',
 'buoninachos',
 'seanspeaks77',
 'mepper',
 'nycsellit4me',
 'esberat',
 'genericwan',
 'dadboddadjokes',
 'Skullzrulerz',
 'seamslegit',
 'Cannibaloxfords10',
 'MysteriiousComposer',
 'samarai4444',
 'kadirba68',
 'gameskull11',
 'strngerdngermaus',
 'signed7',
 'Humble_Award_4873',
 'HugeDetective0',
 'alwayswashere',
 'callmebaiken',
 'Ch33rn0',
 'jbl

### P1.2.2 - Distribution of posts per weekday (5 marks)

Find the percentage of posts that were posted in each weekday (Monday, Tuesday, etc.). You can use an external calendar or you can use any functionality for dealing with dates available in pandas. 

**What to implement:** A function `get_weekday_post_distribution(df)` that takes as input the original dataframe and returns a dictionary of the form (the values are made up):

```
{'Monday': '14%',
'Tuesday': '23%', 
...
}
```

Note that you must only return two decimals, and you must include the percentage sign in the output dictionary. 

Note that in dictionaries order is not preserved, so the order in which it gets printed will not matter. 

In [10]:
def get_weekday_post_distribution(df):
  # your answer here
  date_to_weekdays = []
  for i, row in df.iterrows():
    date_to_weekdays.append(datetime.strptime(row['posted_at'], '%Y-%m-%d %H:%M:%S').strftime('%A'))

  
  total_number_of_rows = df.shape[0]
  return {
      "Monday":f"{int(round(date_to_weekdays.count('Monday')/total_number_of_rows,2)*100)}%",
      "Tuesday":f"{int(round(date_to_weekdays.count('Tuesday')/total_number_of_rows,2)*100)}%",
      "Wednesday":f"{int(round(date_to_weekdays.count('Wednesday')/total_number_of_rows,2)*100)}%",
      "Thursday":f"{int(round(date_to_weekdays.count('Thursday')/total_number_of_rows,2)*100)}%",
      "Friday":f"{int(round(date_to_weekdays.count('Friday')/total_number_of_rows,2)*100)}%",
      "Saturday":f"{int(round(date_to_weekdays.count('Saturday')/total_number_of_rows,2)*100)}%",
      "Sunday":f"{int(round(date_to_weekdays.count('Sunday')/total_number_of_rows,2)*100)}%"
  }

In [11]:
get_weekday_post_distribution(df)

{'Monday': '14%',
 'Tuesday': '15%',
 'Wednesday': '15%',
 'Thursday': '15%',
 'Friday': '15%',
 'Saturday': '14%',
 'Sunday': '13%'}

### P1.2.3 - The 100 most passionate redditors (7 marks)

We would like to know which are the 100 redditors (`author` column) that are most passionate. We will measure this by checking, for each redditor, the ratio at which they use adjectives. This ratio will be computed by dividing number of adjectives by the total number of words each redditor used. The analysis will only consider redditors that have written at least 1000 words.

**What to implement:** A function called `get_passionate_redditors(df)` that takes as input the original dataframe and returns a list of the top 100 redditors (authors) by the ratio at which they use adjectives considering both the `title` and `selftext` columns. The returned list should be a list of tuples, where each inner tuple has two elements: the redditor (author) name, and the ratio of adjectives they used. The returned list should be sorted by adjective ratio in descending order (highest first). Only redditors that wrote more than 1000 words should be considered. You should use `nltk`'s `word_tokenize` and `pos_tag` functions to tokenize and find adjectives. You do not need to do any preprocessing like stopword removal, lemmatization or stemming.

In [12]:
def ratio_cal(row):
  wordsList = nltk.word_tokenize(row['full_text'])
  tagged = nltk.pos_tag(wordsList)
  adj_count = 0
  for t in tagged:
    if t[1] in ['JJ', 'JJR','JJS']:
      adj_count += 1
  return adj_count/row['full_text_len']

def get_passionate_redditors(df):
  # your answer here
  newdf = df.copy()
  newdf['full_text'] = newdf.apply(lambda row: f"{row.title}. {row.selftext}", axis=1)
  newdf['full_text_len'] = newdf.apply(lambda row: len(row.full_text), axis=1)
  newdf = newdf[['author', 'full_text', 'full_text_len']]
  newdf = newdf.loc[newdf['full_text_len'] >= 1000]
  newdf['ratio'] = newdf.apply(lambda row: ratio_cal(row), axis=1)
  sorted_df = newdf.sort_values(by=['ratio'], ascending=False)[['author', 'ratio']].head(100)
  return_list = sorted_df.apply(lambda row: (row.author, row.ratio), axis=1).tolist()

  return return_list

In [13]:
get_passionate_redditors(df)

[('healrstreettalk', 0.04265791632485644),
 ('healrstreettalk', 0.03968253968253968),
 ('healrstreettalk', 0.03634957463263728),
 ('healrstreettalk', 0.03313696612665685),
 ('healrstreettalk', 0.032981530343007916),
 ('healrstreettalk', 0.03289473684210526),
 ('OhanianIsTheBest', 0.03055100927441353),
 ('healrstreettalk', 0.030053034767236298),
 ('CommonEmployment2', 0.02894593118514473),
 ('healrstreettalk', 0.028481012658227847),
 ('healrstreettalk', 0.028101802757158005),
 ('nyello-2000', 0.027966742252456538),
 ('bionista', 0.027938342967244702),
 ('OhanianIsTheBest', 0.027591973244147156),
 ('ArrancarIsaoMizota', 0.027586206896551724),
 ('Kinmuan', 0.0273190621814475),
 ('lilshawnyy420', 0.02670971529204579),
 ('seamslegit', 0.025906735751295335),
 ('I_am_a_[***]_to_ants', 0.02555066079295154),
 ('nyello-2000', 0.02541436464088398),
 ('OhanianIsTheBest', 0.024954510007798286),
 ('Tripmooney', 0.024944974321349962),
 ('Tripmooney', 0.02456306093528578),
 ('Zendexor', 0.024369747899

## P1.3 Ethics (10 marks)

Imagine you are **the head of a data mining company** that needs to use the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as conspiracy or not conspiracy (for example, for hiding potentially harmful tweets or Facebook posts). Some
information about the project and the team:

- Your client is a political party concerned about misinformation.
- The project requires mining Facebook, Reddit and Instagram data.
- The team consists of Joe, an American mathematician who just finished college; Fei, a senior software engineer from China; and Francisco, a data scientist from Spain.

Reflect on the impact of exploiting data science for such an application. You should map your discussion to one of the five actions outlined in the UK’s Data Ethics Framework.

Your answer should address the following:
- Identify the action in which your project is the weakest.
- Then, justify your choice by critically analyzing the three key principles for that action outlined
in the Framework, namely transparency, accountability and fairness.
- Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case.

Your answer should be between 500 and 700 words. **You are strongly encouraged to follow a scholarly approach, e.g., with references to peer reviewed publications. References do not count towards the word limit**.

---

Your answer here