# Transformation of scrapped data

### Sommaire

1- Loading data

2- Removing for duplicates

3- Making a Join on comments with posts

4- Add a feature sentiment analysis

5- Exporting dataframe with all comments

6- Making calculation by post (grouping by information of comments)

In [50]:
import glob
import pandas as pd
import warnings
import re
import spacy
from spacy.lang.fr.stop_words import STOP_WORDS
import datetime

from textblob import Blobber
from textblob_fr import PatternTagger, PatternAnalyzer

# Parameters

In [51]:
subreddit = 'france'
update = True

# Loading data

## Comments & posts

In [52]:
warnings.simplefilter(action='ignore', category=FutureWarning)

# getting csv files from the folder
path = "../scrapping/exports/" + subreddit + "/comments"

# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
all_comments = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   all_comments = all_comments.append(pd.read_csv(file))

all_comments = all_comments.reset_index(drop=True)


Reading file =  ../scrapping/exports/france/comments\france_comments_x67dzf.csv

Reading file =  ../scrapping/exports/france/comments\france_comments_x6ppsd.csv

Reading file =  ../scrapping/exports/france/comments\france_comments_xci5m1.csv

Reading file =  ../scrapping/exports/france/comments\france_comments_xhtzdy.csv

Reading file =  ../scrapping/exports/france/comments\france_comments_yg3dj6.csv


In [53]:
# getting csv files from the folder
path = "../scrapping/exports/" + subreddit + "/posts"

# read all the files with extension .csv
filenames = glob.glob(path + "\*.csv")
print('File names:', filenames)
all_titres = pd.DataFrame()
# for loop to iterate all csv files
for file in filenames:
   # reading csv files
   print("\nReading file = ",file)
   all_titres = all_titres.append(pd.read_csv(file))

all_titres = all_titres.reset_index(drop=True)

File names: ['../scrapping/exports/france/posts\\france_20220901_20221030.csv']

Reading file =  ../scrapping/exports/france/posts\france_20220901_20221030.csv


## Previous transformed data

In [54]:
previous_df = pd.read_parquet('exports/' + subreddit + '/' + subreddit + '_merged.parquet', engine='pyarrow')

## Removing duplicates

In [55]:
all_comments = all_comments.drop_duplicates()
all_comments.rename(columns = {'commentId':'comment_id', 'parent_commentId':'parent_comment_id'}, inplace = True)

all_titres = all_titres.drop_duplicates()
all_titres.rename(columns = {'postId':'post_id'}, inplace = True)

Checking if we still have duplicated on the IDs of both dataframes

In [56]:
assert len(all_comments[all_comments.duplicated(['comment_id'])]) == 0, "Meh, I found some duplicated comments IDs in the dataframe"
assert len(all_titres[all_titres.duplicated(['post_id'])]) == 0, "Meh, I found some duplicated post IDs in the dataframe"

# Joining comments & posts

In [57]:
merged_df = all_comments.merge(all_titres, on="post_id", how = "left")
merged_df.rename(columns = {'authors':'author_comment','author':'author_post', 'created_y':'created_post', 'permalink_x':'permalink_comment', 'permalink_y':'permalink_post'}, inplace = True)
merged_df

Unnamed: 0,post_id,author_comment,comment_id,text,parent_comment_id,created_x,permalink_comment,title,body,url,author_post,created_post,permalink_post,flair
0,x5yqb5,Camulogene,in3xqt1,√áa doit √™tre le retour des transhumances,t3_x5yqb5,2022-09-05 00:00:33,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,RANT : Mes troupeaux de motard sur les routes ...,Je suis en randonn√©e dans le Jura en ce moment...,https://www.reddit.com/r/france/comments/x5yqb...,anyatrans,2022-09-04 23:47:20,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,Forum Libre
1,x5yqb5,RobotSpaceBear,in3z4bu,"""Des gens que je ne connais pas aiment quelque...",t3_x5yqb5,2022-09-05 00:10:48,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,RANT : Mes troupeaux de motard sur les routes ...,Je suis en randonn√©e dans le Jura en ce moment...,https://www.reddit.com/r/france/comments/x5yqb...,anyatrans,2022-09-04 23:47:20,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,Forum Libre
2,x5yqb5,la_mine_de_plomb,in3zclt,üòï,t3_x5yqb5,2022-09-05 00:12:31,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,RANT : Mes troupeaux de motard sur les routes ...,Je suis en randonn√©e dans le Jura en ce moment...,https://www.reddit.com/r/france/comments/x5yqb...,anyatrans,2022-09-04 23:47:20,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,Forum Libre
3,x5yqb5,anyatrans,in40qav,Ah non on peut aimer ce qu'on veut a condition...,t1_in3z4bu,2022-09-05 00:22:59,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,RANT : Mes troupeaux de motard sur les routes ...,Je suis en randonn√©e dans le Jura en ce moment...,https://www.reddit.com/r/france/comments/x5yqb...,anyatrans,2022-09-04 23:47:20,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,Forum Libre
4,x5yqb5,quatruplesec,in43n07,ben t'es pas assez loin des routes de passage,t3_x5yqb5,2022-09-05 00:45:39,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,RANT : Mes troupeaux de motard sur les routes ...,Je suis en randonn√©e dans le Jura en ce moment...,https://www.reddit.com/r/france/comments/x5yqb...,anyatrans,2022-09-04 23:47:20,/r/france/comments/x5yqb5/rant_mes_troupeaux_d...,Forum Libre
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204235,yg3dj6,pousse_tes_fesses,iu81r1y,r/france qui rage quand une image leur rappell...,t3_yg3dj6,2022-10-29 09:39:55,/r/france/comments/yg3dj6/besoin_daide_dans_le...,besoin d'aide dans les Landes,,https://i.redd.it/yz0l1eq36ow91.jpg,Maxipmz,2022-10-29 01:06:42,/r/france/comments/yg3dj6/besoin_daide_dans_le...,Actus
204236,yg3dj6,Maxipmz,iu86ukc,Je suis d√©sol√© mais moi j'arrive √† lire,t3_yg3dj6,2022-10-29 10:56:32,/r/france/comments/yg3dj6/besoin_daide_dans_le...,besoin d'aide dans les Landes,,https://i.redd.it/yz0l1eq36ow91.jpg,Maxipmz,2022-10-29 01:06:42,/r/france/comments/yg3dj6/besoin_daide_dans_le...,Actus
204237,yg3dj6,Maxipmz,iu86xeq,MERCI,t1_iu7vq9b,2022-10-29 10:57:43,/r/france/comments/yg3dj6/besoin_daide_dans_le...,besoin d'aide dans les Landes,,https://i.redd.it/yz0l1eq36ow91.jpg,Maxipmz,2022-10-29 01:06:42,/r/france/comments/yg3dj6/besoin_daide_dans_le...,Actus
204238,yg3dj6,Maxipmz,iu86yb6,Exact,t1_iu81r1y,2022-10-29 10:58:05,/r/france/comments/yg3dj6/besoin_daide_dans_le...,besoin d'aide dans les Landes,,https://i.redd.it/yz0l1eq36ow91.jpg,Maxipmz,2022-10-29 01:06:42,/r/france/comments/yg3dj6/besoin_daide_dans_le...,Actus


### 4- Add a feature sentiment analysis

a- Construction of a NLP pipeline to clean the comments

In [58]:

def nlp_pipeline(comment) -> str:
    comment = str(comment).lower()
    comment = comment.replace('\n', ' ').replace('\r', '')
    comment = ' '.join(comment.split())
    comment = re.sub(r"[A-Za-z\.]*[0-9]+[A-Za-z%¬∞\.]*", "", comment)
    comment = re.sub(r"(\s\-\s|-$)", "", comment)
    comment = re.sub(r"[,\!\?\%\(\)\/\"]", "", comment)
    comment = re.sub(r"\&\S*\s", "", comment)
    comment = re.sub(r"\&", "", comment)
    comment = re.sub(r"\+", "", comment)
    comment = re.sub(r"\#", "", comment)
    comment = re.sub(r"\$", "", comment)
    comment = re.sub(r"\¬£", "", comment)
    comment = re.sub(r"\%", "", comment)
    comment = re.sub(r"\:", "", comment)
    comment = re.sub(r"\@", "", comment)
    comment = re.sub(r"\-", "", comment)

    return comment

In [59]:
stop_words = set(STOP_WORDS)
deselect_stop_words = ['ne','pas','plus','personne','aucun','ni','aucune','rien']
for w in deselect_stop_words:
    if w in stop_words:
        stop_words.remove(w)
    else:
        continue

### Applying pipeline and removing stopwords from our dataframe

In [60]:
merged_df['text_processed'] = merged_df['text'].apply(nlp_pipeline)
merged_df['text_processed'] = merged_df['text_processed'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

### Using the commentBlob library to get sentiment analysis

In [61]:
tb = Blobber(pos_tagger=PatternTagger(), analyzer=PatternAnalyzer())

senti_num_list = []
senti_cat_list = []
for i in merged_df["text_processed"]:
    vs = tb(i).sentiment[0]
    senti_num_list.append(vs)
    if (vs > 0.08):
        senti_cat_list.append('Positive')
    elif (vs < -0.08):
        senti_cat_list.append('Negative')
    else:
        senti_cat_list.append('Neutral')

merged_df['sentiment_num'] = senti_num_list
merged_df['sentiment_cat'] = senti_cat_list

In [62]:
print('## Original comment ##')
print(merged_df['text'][100])
print('## Score comment ##')
print(merged_df['sentiment_num'][100])

## Original comment ##
La grosse diff√©rence c'est que ceux qui emp√™chent les autres de profiter de leur dimanche... ce sont les motards justement.

Perso, j'adore faire de la musique. Avec des amplis √† lampe. Fort. Et plein de distorsion. Et pas n√©cessairement des choses m√©lodieuses. J'aime la noise. Je pourrais l√©galement faire ce que je veux chez moi jusqu'√† 22h30 et faire des concertos de larsen tous les jours de la semaine si je le souhaitais.
Sauf que...
Je ne le fais pas. Je respecte mes voisins et leur besoin de tranquillit√©. Et j'attends d'eux qu'ils fassent de m√™me en √©change.

Et quand je veux faire du bruit ? Je vais dans un lieu d√©di√© √† √ßa tout simplement (studio de r√©p√©te par exemple).
## Score comment ##
0.29


In [63]:
def save_transform(df:pd.DataFrame):
    file_name = subreddit + '_merged'
    path_csv = 'exports/france/' + file_name + '.csv'
    df.to_csv(path_csv, index = False, encoding = 'utf-8')
    print('Saved csv df in : ' + path_csv)
    path_parquet = 'exports/france/' + file_name + '.parquet'
    df.to_parquet(path_parquet, index = False, engine='pyarrow')
    print('Saved parquet df in : ' + path_parquet)

save_transform(merged_df)

Saved csv df in : exports/france/france_merged.csv
Saved parquet df in : exports/france/france_merged.parquet
