
In this question you will write Python code for processing, analyzing and understanding the social network **Reddit** (www.reddit.com). Reddit is a platform that allows users to upload posts and comment on them, and is divided in _subreddits_, often covering specific themes or areas of interest (for example, [world news](https://www.reddit.com/r/worldnews/), [ukpolitics](https://www.reddit.com/r/ukpolitics/) or [nintendo](https://www.reddit.com/r/nintendo)). You are provided with a subset of Reddit with posts from Covid-related subreddits (e.g., _CoronavirusUK_ or _NoNewNormal_), as well as randomly selected subreddits (e.g., _donaldtrump_ or _razer_).

The `csv` dataset you are provided contains one row per post, and has information about three entities: **posts**, **users** and **subreddits**. The column names are self-explanatory: columns starting with the prefix `user_` describe users, those starting with the prefix `subr_` describe subreddits, the `subreddit` column is the subreddit name, and the rest of the columns are post attributes (`author`, `posted_at`, `title` and post text - the `selftext` column-, number of comments - `num_comments`, `score`, etc.).

In this exercise, you are asked to perform a number of operations to gain insights from the data.

In [227]:
# suggested imports
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()
from ast import literal_eval
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Prasad\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Prasad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Prasad\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [228]:
import pandas as pd

# Define URL and filename
module_url = "https://media.githubusercontent.com/media/#/Test_Datasets/main/ML_Test_1.csv"

# Read the CSV directly from the URL into a pandas DataFrame
df = pd.read_csv(module_url)

# This fills empty cells with empty strings
# df = df.fillna('')

df.shape

(19940, 17)

In [229]:
df.head()

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599


## P1.1 - Text data processing

### P1.1.1 - Faved by as lists

The column `subr_faved_by` contains an array of values (names of redditors who added the subreddit to which the current post was submitted), but unfortunately they are in text format, and you would not be able to process them properly without converting them to a suitable python type. You must convert these string values to Python lists, going from

```python
'["user1", "user2" ... ]'
```

to

```python
["user1", "user2" ... ]
```

**What to implement:** Implement a function `transform_faves(df)` which takes as input the original dataframe and returns the same dataframe, but with one additional column called `subr_faved_by_as_list`, where you have the same information as in `subr_faved_by`, but as a python list instead of a string.

In [232]:
eval('2+9*66')

596

In [233]:
import ast

df['subr_faved_by_as_list']=df['subr_faved_by'].apply(lambda x : ast.literal_eval(x))

In [234]:
df.head()

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio,subr_faved_by_as_list
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.0,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo..."


In [235]:
df['subr_faved_by_as_list']

0        [vergil_never_cry, Jelegend, pianoyeah, salomo...
1        [vergil_never_cry, Jelegend, pianoyeah, salomo...
2        [vergil_never_cry, Jelegend, pianoyeah, salomo...
3        [vergil_never_cry, Jelegend, pianoyeah, salomo...
4        [vergil_never_cry, Jelegend, pianoyeah, salomo...
                               ...                        
19935    [solex125, redreddington22, HibikiSS, klondipe...
19936    [solex125, redreddington22, HibikiSS, klondipe...
19937    [solex125, redreddington22, HibikiSS, klondipe...
19938    [solex125, redreddington22, HibikiSS, klondipe...
19939    [solex125, redreddington22, HibikiSS, klondipe...
Name: subr_faved_by_as_list, Length: 19940, dtype: object

### P1.1.2 - Merge titles and text bodies

All Reddit posts need to have a title, but a text body is optional. However, we want to be able to access all free text information for each post without having to look at two columns every time.

**What to implement**: A function `concat(df)` that will take as input the original dataframe and will return it with an additional column called `full_text`, which will concatenate `title` and `selftext` columns, but with the following restrictions:

- 1) Wrap the title between `<title>` and `</title>` tags.
- 2) Add a new line (`\n`) between title and selftext, but only in cases where you have both values (see instruction 4).
- 3) Wrap the selftext between `<selftext>` and `</selftext>`.
- 4) You **must not** include the tags in points (1) or (3) if the values for these columns is missing. We will consider a missing value either an empty value (empty string) or a string of only one character (e.g., an emoji). Also, the value of a `full_text` column must not end in the new line character.

In [237]:
# def concat(df):
#     # your code here
#     df.fillna(np.nan,inplace=True) # filling up missing values with nan 
#     title=df['title'].values.ravel() # creating a 1D a... 
#     (df['title'].isna()) |  (df['selftext'].isna())
#     df['full_text'] = df['title'])+df['selftext']
#     # df['full_text'] = '<title>' + df['title'] + '</title>' + "\n" + '<selftext>' + df['selftext'] + '</selftext>'
#     # return df

# df = concat(df)


In [238]:
df['selftext'].nunique

<bound method IndexOpsMixin.nunique of 0                                                      NaN
1                                                      NaN
2                                                      NaN
3                                                      NaN
4                                                      NaN
                               ...                        
19935                                                  NaN
19936    Then I think we might get 18 songs, outro usua...
19937    He has 25songs to perform plus the additional ...
19938       I got goose[***]ps just by thinking about it 😬
19939                                                  NaN
Name: selftext, Length: 19940, dtype: object>

In [239]:
df['title'].nunique

<bound method IndexOpsMixin.nunique of 0        BREAKING: Trump to begin hiding in mailboxes t...
1                                      Joe Biden's America
2        4 more years and we can erase his legacy for g...
3        Revelation 9:6 [Transhumanism: The New Religio...
4                                           LOOK HERE, FAT
                               ...                        
19935                                            carti why
19936                             If uzi on track 3 and 16
19937            Man carti’s concerts are gonna be long af
19938    Can’t wait to see Carti going full rage mode o...
19939             [OFFTOPIC] have you seen that new LV? 😂💀
Name: title, Length: 19940, dtype: object>

In [240]:
# for -
# if (df['selftext'].isnull().sum()!=0) & (df['title'].isnull()!= 0) :
#     df['full_text'] = ('<title>' + df['title'] + '</title>') + '\n' + ('<selftext>' + df['selftext'] + '</selftext>')
# # else if ((df['selftext'].null() == 0) & (df['title'].null() == 0))


In [241]:
# df['selftext'].isnull()  !=0

In [242]:
# df['a'] = '<title>' + df['title'] + '</title>' +'\n' + ('<selftext>' + df['selftext'] + '</selftext>')

In [243]:
# df['a']

In [244]:
import pandas as pd
import numpy as np

def concat(df):
    """
    Concatenates 'title' and 'selftext' columns into a new 'full_text' column.
    """
    # Create a copy to avoid modifying the original DataFrame
    df_copy = df.copy()

    # Define a helper function to check for "missing" values
    def is_missing(text):
        if pd.isna(text):
            return True
        text_stripped = str(text).strip()
        return not text_stripped or len(text_stripped) <= 1

    # Create the full_text column by applying a function row-wise
    df_copy['full_text'] = df_copy.apply(lambda row: create_full_text(row), axis=1)
    
    return df_copy

def create_full_text(row):
    """
    Helper function to create the full_text string for a single row.
    """
    title = row['title']
    selftext = row['selftext']

    title_exists = not is_missing(title)
    selftext_exists = not is_missing(selftext)

    full_text_parts = []
    if title_exists:
        full_text_parts.append(f'<title>{title}</title>')
    
    if selftext_exists:
        if title_exists:
            full_text_parts.append('\n')
        full_text_parts.append(f'<selftext>{selftext}</selftext>')
    
    return ''.join(full_text_parts)

def is_missing(text):
    """
    Checks if a string is considered 'missing' based on the rules.
    """
    if pd.isna(text):
        return True
    text_stripped = str(text).strip()
    return not text_stripped or len(text_stripped) <= 1

In [245]:
df = concat(df)


In [246]:
df

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio,subr_faved_by_as_list,full_text
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.00,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>BREAKING: Trump to begin hiding in mail...
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>Joe Biden's America</title>
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.00,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>4 more years and we can erase his legac...
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.00,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>Revelation 9:6 [Transhumanism: The New ...
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...","<title>LOOK HERE, FAT</title>"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19935,zqrwiel,2020-07-23 16:39:15,11,246,,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,carti why,0,1.00,1883,2014-02-12,0.861626,"[solex125, redreddington22, HibikiSS, klondipe...",<title>carti why</title>
19936,zqrwiel,2020-12-15 11:25:07,39,1,"Then I think we might get 18 songs, outro usua...",2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,If uzi on track 3 and 16,0,1.00,1883,2014-02-12,0.861626,"[solex125, redreddington22, HibikiSS, klondipe...",<title>If uzi on track 3 and 16</title>\n<self...
19937,zqrwiel,2020-12-27 13:57:49,15,1,He has 25songs to perform plus the additional ...,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,Man carti’s concerts are gonna be long af,0,1.00,1883,2014-02-12,0.861626,"[solex125, redreddington22, HibikiSS, klondipe...",<title>Man carti’s concerts are gonna be long ...
19938,zqrwiel,2020-12-29 12:07:10,6,1,I got goose[***]ps just by thinking about it 😬,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,Can’t wait to see Carti going full rage mode o...,0,1.00,1883,2014-02-12,0.861626,"[solex125, redreddington22, HibikiSS, klondipe...",<title>Can’t wait to see Carti going full rage...


### P1.1.3 - Enrich posts

We would like to augment our text data with linguistic information. To this end, we will _tokenize_, apply _part-of-speech tagging_, and then we will _lower case_ all the posts.

**What to implement**: A function `enrich_posts(df)` that will take as input the original dataframe and will return it with **two** additional columns: `enriched_title` and `enriched_selftext`. These columns will contain tokenized, pos-tagged and lower cased versions of the original text. **You must implement them in this order**, because the pos tagger uses casing information.

In [248]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Prasad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Prasad\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [249]:
def tokenization_pos(df):
    li = []
    for text in df:
        if type(text) != float:
            tokens = re.split("/s+", text)
            pos_tags = pos_tag(tokens)
            # print(pos_tags)
            enrich_posts = list(map(lambda x: x[0].lower() ,  pos_tags ))
            li.append(enrich_posts)
        else:
            li.append('No string')
    return li
df['enriched_title'] = tokenization_pos(df['title'])
df['enriched_selftext'] = tokenization_pos(df['selftext'])


In [250]:
df

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,...,title,total_awards_received,upvote_ratio,user_num_posts,user_registered_at,user_upvote_ratio,subr_faved_by_as_list,full_text,enriched_title,enriched_selftext
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,BREAKING: Trump to begin hiding in mailboxes t...,0,1.00,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>BREAKING: Trump to begin hiding in mail...,[breaking: trump to begin hiding in mailboxes ...,No string
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,Joe Biden's America,0,0.67,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>Joe Biden's America</title>,[joe biden's america],No string
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,4 more years and we can erase his legacy for g...,0,1.00,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>4 more years and we can erase his legac...,[4 more years and we can erase his legacy for ...,No string
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,Revelation 9:6 [Transhumanism: The New Religio...,0,1.00,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...",<title>Revelation 9:6 [Transhumanism: The New ...,[revelation 9:6 [transhumanism: the new religi...,No string
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,...,"LOOK HERE, FAT",0,0.88,4661,2012-11-09,-0.658599,"[vergil_never_cry, Jelegend, pianoyeah, salomo...","<title>LOOK HERE, FAT</title>","[look here, fat]",No string
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19935,zqrwiel,2020-07-23 16:39:15,11,246,,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,...,carti why,0,1.00,1883,2014-02-12,0.861626,"[solex125, redreddington22, HibikiSS, klondipe...",<title>carti why</title>,[carti why],No string
19936,zqrwiel,2020-12-15 11:25:07,39,1,"Then I think we might get 18 songs, outro usua...",2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,...,If uzi on track 3 and 16,0,1.00,1883,2014-02-12,0.861626,"[solex125, redreddington22, HibikiSS, klondipe...",<title>If uzi on track 3 and 16</title>\n<self...,[if uzi on track 3 and 16],"[then i think we might get 18 songs, outro usu..."
19937,zqrwiel,2020-12-27 13:57:49,15,1,He has 25songs to perform plus the additional ...,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,...,Man carti’s concerts are gonna be long af,0,1.00,1883,2014-02-12,0.861626,"[solex125, redreddington22, HibikiSS, klondipe...",<title>Man carti’s concerts are gonna be long ...,[man carti’s concerts are gonna be long af],[he has 25songs to perform plus the additional...
19938,zqrwiel,2020-12-29 12:07:10,6,1,I got goose[***]ps just by thinking about it 😬,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,...,Can’t wait to see Carti going full rage mode o...,0,1.00,1883,2014-02-12,0.861626,"[solex125, redreddington22, HibikiSS, klondipe...",<title>Can’t wait to see Carti going full rage...,[can’t wait to see carti going full rage mode ...,[i got goose[***]ps just by thinking about it 😬]


## P1.2 - Answering questions with pandas

In this question, your task is to use pandas to answer questions about the data.

### P1.2.1 - Users with best scores

- Find the users with the highest aggregate scores (over all their posts) for the whole dataset. You should restrict your results to only those whose aggregated score is above 10,000 points, in descending order. Your code should generate a dictionary of the form `{author:aggregated_scores ... }`.

In [253]:
a  =df.groupby('author')['score'].sum()


In [254]:
a[a >10000].to_dict()

{'BlanketMage': 13677,
 'DaFunkJunkie': 250375,
 'Dajakesta0624': 11613,
 'JLBesq1981': 58235,
 'NewAltWhoThis': 12771,
 'NotsoPG': 18518,
 'OldFashionedJizz': 64398,
 'SUPERGUESSOUS': 211611,
 'SonictheManhog': 18116,
 'TheGamerDanYT': 25357,
 'TheJeck': 26058,
 'TrumpSharted': 21154,
 'Wagamaga': 47989,
 'apocalypticalley': 10382,
 'chrisdh79': 143538,
 'hildebrand_rarity': 122464,
 'hilltopye': 81245,
 'iSlingShlong': 118595,
 'jigsawmap': 210824,
 'kevinmrr': 11900,
 'rspix000': 57107,
 'stem12345679': 47455,
 'tefunka': 79560}

In [255]:
def user_best_score(df):
    # your code here
    a  =df.groupby('author')['score'].sum()
    return a[a >10000].to_dict()

dict = user_best_score(df)



In [256]:
dict

{'BlanketMage': 13677,
 'DaFunkJunkie': 250375,
 'Dajakesta0624': 11613,
 'JLBesq1981': 58235,
 'NewAltWhoThis': 12771,
 'NotsoPG': 18518,
 'OldFashionedJizz': 64398,
 'SUPERGUESSOUS': 211611,
 'SonictheManhog': 18116,
 'TheGamerDanYT': 25357,
 'TheJeck': 26058,
 'TrumpSharted': 21154,
 'Wagamaga': 47989,
 'apocalypticalley': 10382,
 'chrisdh79': 143538,
 'hildebrand_rarity': 122464,
 'hilltopye': 81245,
 'iSlingShlong': 118595,
 'jigsawmap': 210824,
 'kevinmrr': 11900,
 'rspix000': 57107,
 'stem12345679': 47455,
 'tefunka': 79560}

### P1.2.2 - Awarded posts

Find the number of posts that have received at least one award. Your query should return only one value.

In [258]:
l =df[df['total_awards_received'] >= 1] #.max()['user_num_posts']

In [259]:
l.shape[0]

119

In [260]:
k = df.loc[df['total_awards_received'] >= 1]['user_num_posts']

In [261]:
print(k)
print(k.sum())

67       1105
69       1105
818      8292
866      6102
911      6900
         ... 
17041    2476
17361    4838
18599    3862
18823    4604
19639     355
Name: user_num_posts, Length: 119, dtype: int64
653664


In [262]:
# your code here
def Awarded_posts(df):
    # your code here
    l =df[df['total_awards_received'] >= 1]
    
    
    return l.shape[0]

number_of_posts = Awarded_posts(df)
number_of_posts

119

In [303]:
df.columns

Index(['author', 'posted_at', 'num_comments', 'score', 'selftext',
       'subr_created_at', 'subr_description', 'subr_faved_by',
       'subr_numb_members', 'subr_numb_posts', 'subreddit', 'title',
       'total_awards_received', 'upvote_ratio', 'user_num_posts',
       'user_registered_at', 'user_upvote_ratio', 'subr_faved_by_as_list',
       'full_text', 'enriched_title', 'enriched_selftext'],
      dtype='object')

In [321]:

[:,0:15]

Unnamed: 0,author,posted_at,num_comments,score,selftext,subr_created_at,subr_description,subr_faved_by,subr_numb_members,subr_numb_posts,subreddit,title,total_awards_received,upvote_ratio,user_num_posts
0,-Howitzer-,2020-08-17 20:26:04,19,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,BREAKING: Trump to begin hiding in mailboxes t...,0,1.00,4661
1,-Howitzer-,2020-07-06 17:01:48,1,3,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Joe Biden's America,0,0.67,4661
2,-Howitzer-,2020-09-09 02:29:02,3,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,4 more years and we can erase his legacy for g...,0,1.00,4661
3,-Howitzer-,2020-06-23 23:02:39,2,1,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,Revelation 9:6 [Transhumanism: The New Religio...,0,1.00,4661
4,-Howitzer-,2020-08-07 04:13:53,32,622,,2009-04-29,Subreddit about Donald Trump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ...",30053,796986,donaldtrump,"LOOK HERE, FAT",0,0.88,4661
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19935,zqrwiel,2020-07-23 16:39:15,11,246,,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,carti why,0,1.00,1883
19936,zqrwiel,2020-12-15 11:25:07,39,1,"Then I think we might get 18 songs, outro usua...",2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,If uzi on track 3 and 16,0,1.00,1883
19937,zqrwiel,2020-12-27 13:57:49,15,1,He has 25songs to perform plus the additional ...,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,Man carti’s concerts are gonna be long af,0,1.00,1883
19938,zqrwiel,2020-12-29 12:07:10,6,1,I got goose[***]ps just by thinking about it 😬,2009-04-13,A subreddit dedicated to the discussion of hip...,"['solex125', 'redreddington22', 'HibikiSS', 'k...",8740,630857,playboicarti,Can’t wait to see Carti going full rage mode o...,0,1.00,1883


### P1.2.3 Find Covid

Find the name and description of all subreddits where the name starts with `Covid` or `Corona` and the description contains `covid` or `Covid` anywhere. Your code should generate a dictionary of the form#

```python
  {'Coronavirus':'Place to discuss all things COVID-related',
  ...
  }
```

In [384]:
# your code here
o = df[df['title'].str.startswith('Covid')| df['title'].str.startswith('Corona')] 
m = o[o['subr_description'].str.contains('Covid') | o['subr_description'].str.contains('covid')]
m[['title','subr_description']]

Unnamed: 0,title,subr_description
700,Coronavirus Tips For Working At Home: Your Wor...,Tracking the Coronavirus/Covid-19 outbreak in ...
5565,Coronavirus update: Fresno death toll sets new...,Tracking the Coronavirus/Covid-19 outbreak in ...
5568,Coronavirus update: Thousands more cases repor...,Tracking the Coronavirus/Covid-19 outbreak in ...
7727,Covid-19 ITALY Disinfecting improvisation Wuha...,Tracking the Coronavirus/Covid-19 outbreak in ...
9640,Coronavirus Update: Newsom Announces Mortgage ...,Tracking the Coronavirus/Covid-19 outbreak in ...
9642,Coronavirus fallout: 17 California metros at s...,Tracking the Coronavirus/Covid-19 outbreak in ...
9646,Coronavirus: California’s patients in ICUs tri...,Tracking the Coronavirus/Covid-19 outbreak in ...


In [397]:
q = pd.Series(m['subr_description'].values, index=m['title']).to_dict()

In [399]:
q

{'Coronavirus Tips For Working At Home: Your Work Space': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'Coronavirus update: Fresno death toll sets new record as city weighs shutdown order': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'Coronavirus update: Thousands more cases reported in Fresno ahead of death toll update': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'Covid-19 ITALY Disinfecting improvisation Wuhan Coronavirus': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'Coronavirus Update: Newsom Announces Mortgage Payment Relief, Increased Unemployment Funds': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'Coronavirus fallout: 17 California metros at severe economic risk': 'Tracking the Coronavirus/Covid-19 outbreak in California',
 'Coronavirus: California’s patients in ICUs triple, hospitalizations double. 1,432 COVID-19 patients in hospitals, 597 in ICUs across the state': 'Tracking the Coronavirus/C

### P1.2.4 - Redditors that favorite the most

Find the users that have favorited the largest number of subreddits. You must produce a pandas dataframe with **two** columns, with the following format:

```python
     redditor	    numb_favs
0	user1           7
1	user2           6
2	user3	       5
3	user4           4
...
```

where the first column is a Redditor username and the second column is the number of distinct subreddits he/she has favorited.

In [412]:
# your code here

w = df[['subreddit','subr_faved_by']]
w


Unnamed: 0,subreddit,subr_faved_by
0,donaldtrump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ..."
1,donaldtrump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ..."
2,donaldtrump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ..."
3,donaldtrump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ..."
4,donaldtrump,"['vergil_never_cry', 'Jelegend', 'pianoyeah', ..."
...,...,...
19935,playboicarti,"['solex125', 'redreddington22', 'HibikiSS', 'k..."
19936,playboicarti,"['solex125', 'redreddington22', 'HibikiSS', 'k..."
19937,playboicarti,"['solex125', 'redreddington22', 'HibikiSS', 'k..."
19938,playboicarti,"['solex125', 'redreddington22', 'HibikiSS', 'k..."


In [424]:
w.groupby('subreddit')['subr_faved_by'].count()

subreddit
40kLore          196
AMD_Stock        141
Anki              30
ApexOutlands      58
BanGDream        170
                ... 
touhou           798
virginvschad      57
wicked_edge      126
worldbuilding    148
xqcow            309
Name: subr_faved_by, Length: 63, dtype: int64

## P1.3 Ethics

Imagine you use the insights gained in this assignment to scan social media for covid-related content, and automatically flag it as `conspiracy` or `not conspiracy` (for example, for hiding potentially harmful tweets or facebook posts). Reflect on the impact of exploiting data science for such an application.


Your answer should address the following:
- Identify the action that, in your opinion, is the weakest. 
- Then, justify your choice by critically analyzing the three key principles outlined in the Framework, namely transparency, accountability and fairness. 
- Finally, you should propose one solution that explicitly addresses one point related to one of these three principles, reflecting on how your solution would improve the data cycle in this particular use case. 

Your answer should be between 500 and 700 words. 

---

Your answer here