## Data Cleaning Plan

We have three data sets:
- Markers' bios and metadata (markers_bios)
- Followers' bios and metadata (followers_bios)
- All brands and their followers (markers-followers)


Step by step plan:
1. Load the bios of followers, and the marker-follower file. 
    - Provide summary statistics of users and brands. How many brands do we have? How many followers? Any missing data, duplicates etc.?

2. Filter on marker-follower df:
    - Create a dictionary of counts brands per follower
    - Remove users that follow less than 5 (or more) brands
    - Continuously track numbers of users removed
    - Match the Follower_Ids in the now filtered marker-follower df with the follower-bio df. As such, the follower bios will only include users that follow more than five brands. Subsequent filters will be on the correct users (up to date follower-bios).

3. Do the filters on the follower-bios:
    - Remove users with less than 25 followers
    - Remove users with less than 100 tweets

4. Filter based on language: keep only french accounts









In [51]:
# Standard library imports
import os
import re
import csv
import sys
import html
from datetime import datetime
from collections import defaultdict

# Third-party library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import dask.dataframe as dd
import psutil
import pickle
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import emoji

# Local application/library specific imports
import utils2
from utils2 import *



from unidecode import unidecode
import importlib

## 1. Load files and summary stats

In [22]:
# # Load the data files and rename ID columns
importlib.reload(utils2)

# Load markers-followers
#Load marker followers
load_path = '/home/livtollanes/SocialMarkers'
file = 'markers_followers_2023-05-19.csv'

req_cols = ['id', 'follower_id']
dtypes = {'id': 'object',
          'follower_id': 'object'}

markers_followers = utils2.fileloader(load_path, file, req_cols, dtypes)


#rename the twittwer id column to follower id 
markers_followers.rename(columns={'id':'marker_id'}, inplace=True)

In [7]:
# Load the followers bios and rename ID columns
load_path = '/home/livtollanes/SocialMarkers'
file = 'markers_followers_bios_2023-05-19.csv'

req_cols = ['twitter_id', 'id', 'screen_name', 'description', 'location', 'tweets', 'followers', 'friends', 'likes', 'lists','timestamp_utc']

dtypes = {
    'twitter_id': 'object',
    'id': 'object',
    'screen_name': 'object',
    'description': 'object',
    'location': 'object',
    'tweets': 'float64',
    'followers': 'float64',
    'friends': 'float64',
    'witheld_in_countries': 'float64'
}

followers_bios = utils2.fileloader(load_path, file, req_cols, dtypes)



#rename the twittwer id column to follower id 
followers_bios.rename(columns={'twitter_id':'follower_id'}, inplace=True)

Summary statistics

In [8]:
importlib.reload(utils2)
utils2.summary_stats(followers_bios, print_dtypes=False)

Shape of DataFrame:  (70666646, 11)

Columns in DataFrame:  ['follower_id', 'id', 'screen_name', 'description', 'timestamp_utc', 'location', 'tweets', 'followers', 'friends', 'likes', 'lists']

Number of unique values in 'follower_id':  70666646

Number of duplicate values in 'follower_id':  0

Number of unique values in 'id':  70642661

Number of duplicate values in 'id':  23984

Number of missing values in each column:
 follower_id             0
id                  23985
screen_name         23986
description      42027215
timestamp_utc       23985
location         47956041
tweets              23985
followers           23985
friends             23985
likes               23985
lists               23985
dtype: int64


In [27]:
importlib.reload(utils2)
utils2.summary_stats(markers_followers, print_dtypes=False)

Shape of DataFrame:  (126345412, 2)

Columns in DataFrame:  ['marker_id', 'follower_id']

Number of unique values in 'follower_id':  70636295

Number of duplicate values in 'follower_id':  55709117

Number of unique values in 'marker_id':  236

Number of duplicate values in 'marker_id':  126345176

Number of missing values in each column:
 marker_id      0
follower_id    0
dtype: int64


In [29]:
compare_column_values(followers_bios, markers_followers, 'follower_id')

There are 30351 unique values in df1 that don't exist in df2.
There are 0 unique values in df2 that don't exist in df1.


## 2. Filter the marker-follower df

- Filter the marker-follower df:
    - Remove users that follow less than 5 (or more) brands

    - Continuously track numbers of users removed
    
    - Match the Follower_Ids in the now filtered marker-follower df with the follower-bio df. As such, the follower bios 
    will only include users that follow more than five brands. Subsequent filters will be on the correct users (up to date follower-bios).

Remove users that follow less than 5 brands

In [31]:
n = 5  # minimal number of brands followed required to be included in the analysis
markers_followers_5 = utils2.filter_followers(markers_followers, 'follower_id', n)

66606560 followers follow less than 5 brands (94.30% of the total followers).
After removing these followers, 4029735 followers are left (5.70% of the followers in the inputted df).


Match the IDs in the filtered marker-follower df with the follower bio df, so that the follower bios only are for those who follow at least 5 brands

In [32]:
followers_bios_5 = utils2.streamline_IDs(markers_followers_5, followers_bios, 'follower_id')

Number of unique follower_id in source: 4029735
Number of unique follower_id in df_tofilter after filtering: 4029735
Removed 66636911 rows.
4029735 rows are left.


In [33]:
compare_column_values(followers_bios_5, markers_followers_5, 'follower_id')   

There are 0 unique values in df1 that don't exist in df2.
There are 0 unique values in df2 that don't exist in df1.


## 3. Do the filters on the follower-bios:
- Remove users with less than 25 followers
- Remove users with less than 100 tweets
- Update the markers-followers df to match the now filtered bio df
- Filter based on language: keep only french accounts


In [35]:

followers_bios_fullfilter = utils2.filter_by_tweets_and_followers(followers_bios_5, min_followers= 25, min_tweets= 100)


Removed 2789232 rows.
1240503 rows are left.


Again, remove the follower_Ids in markers-followers that don't occur in the newly filtered  followers_bios_nd5_tweets_followers

In [36]:
markers_followers_fullfilter = utils2.streamline_IDs(source= followers_bios_fullfilter, df_tofilter=markers_followers_5, column='follower_id')

Number of unique follower_id in source: 1240503
Number of unique follower_id in df_tofilter after filtering: 1240503
Removed 19251355 rows.
9970120 rows are left.


In [37]:
compare_column_values(followers_bios_fullfilter, markers_followers_fullfilter , 'follower_id')

There are 0 unique values in df1 that don't exist in df2.
There are 0 unique values in df2 that don't exist in df1.


In [63]:
#Before writing ti csv, clean description column to avoid writing problems
importlib.reload(utils2)
followers_bios_fullfilter = utils2.process_description(followers_bios_fullfilter, 'description')

In [66]:
# Inspect rows from index 10 to 20
pd.set_option('display.max_colwidth', None)   
followers_bios_fullfilter[60:80]

Unnamed: 0,follower_id,id,screen_name,description,timestamp_utc,location,tweets,followers,friends,likes,lists,description_cleantext
2993,1412345621185400846,1412345621185400846,kiff2005,||🍣||,1625564000.0,,181.0,80.0,479.0,1247.0,0.0,||||
3062,2966283549,2966283549,arthur_roussel1,"Spandau, Berlin, Europa. Richtung Osten und Gerechtigkeit. \n#EUerSpandau 🐻🇪🇺🎲\nDes vues franco-allemandes souvent contradictoires, mais toujours européennes.",1420651000.0,"Berlin, Allemagne",920.0,160.0,1180.0,2204.0,0.0,"Spandau, Berlin, Europa. Richtung Osten und Gerechtigkeit. \n#EUerSpandau \nDes vues franco-allemandes souvent contradictoires, mais toujours europeennes."
3234,2417446394,2417446394,mantasroudoniki,,1396100000.0,hollywood,413.0,304.0,1061.0,16.0,1.0,
3257,347912349,347912349,weemeryum,,1312385000.0,,148.0,64.0,1202.0,872.0,0.0,
3346,933557377,933557377,LoueThomas,my demon will destroy you.🤡,1352345000.0,,361.0,247.0,1293.0,346.0,2.0,my demon will destroy you.
3400,1328246257,1328246257,fancynkr,"CERTIFIED FREELANCE MAKEUP ARTIST , HAIRSTYLIST, & NAIL TECH FOR ANY OCCASION 💕💄💅🏼💸 FOLLOW ME ON IG !!! @fancynkr",1365129000.0,"Milltown, NJ",848.0,109.0,437.0,1094.0,12.0,"CERTIFIED FREELANCE MAKEUP ARTIST , HAIRSTYLIST, & NAIL TECH FOR ANY OCCASION FOLLOW ME ON IG !!! @fancynkr"
3412,3305256363,3305256363,Esse_NonVideri,,1433130000.0,"Bordeaux, France",657.0,33.0,495.0,1637.0,1.0,
3440,2375059735,2375059735,6e2da7beedc94cb,,1394096000.0,,117.0,255.0,4658.0,150.0,18.0,
3470,1561159488,1561159488,Yalin_monsalve,"•La vida me consiente♥♥ »•No soy la mejor pero, soy Única.!",1372703000.0,Caracas,1802.0,247.0,1786.0,54.0,3.0,"*La vida me consiente >>*No soy la mejor pero, soy Unica.!"
3481,342936505,342936505,FatimaRaam,Quedé 🤡,1311712000.0,,336.0,377.0,1153.0,2381.0,2.0,Quede


In [92]:
print(followers_bios_fullfilter.dtypes)

follower_id               object
id                        object
screen_name               object
description               object
timestamp_utc            float64
location                  object
tweets                   float64
followers                float64
friends                  float64
likes                    float64
lists                    float64
description_cleantext     object
dtype: object


In [85]:
# # #Now write the two dfs to csvs to save them in case something happens
# markers_followers_fullfilter.to_csv('/home/livtollanes/NewData/markers_followers_cleaned_nolang.csv', encoding='utf-8', index=False)

#followers_bios_fullfilter.to_csv('/home/livtollanes/NewData/followers_bios_cleaned_nolang3.csv', sep=',', encoding='utf-8', index=False, quoting=csv.QUOTE_NONNUMERIC)



## 4. Filter based on language: keep only french accounts
- Use language recognition alorithms to filter the follower_bios. 
- We only want french language bios to be included


In [None]:
# If done from non-ran Kernel, load the dataframes from csvs. remember to look in the wordata dir

In [39]:
#Load marker followers
req_cols = ['marker_id', 'follower_id']
dtypes = {'marker_id': 'object',
          'follower_id': 'object'}

markers_followers_clean = pd.read_csv('/home/livtollanes/NewData/markers_followers_cleaned_nolang.csv', encoding='utf-8', dtype=dtypes, usecols=req_cols)

In [89]:
#Loading the followers bios (with cleaned description column)
full_path = '/home/livtollanes/NewData/followers_bios_cleaned_nolang3.csv'

req_cols = ['follower_id', 'screen_name', 'description', 'description_cleantext', 'location', 'tweets', 'followers', 'friends', 'likes', 'lists','timestamp_utc']

dtypes = {
    'follower_id': 'object',
    'screen_name': 'object',
    'description': 'object',
    'description_cleantext': 'object',
    'location': 'object',
    'tweets': 'float64',
    'followers': 'float64',
    'friends': 'float64'
}

follower_bios_cleaned3 = pd.read_csv(full_path, usecols=req_cols, dtype=dtypes, engine= 'python')

In [90]:
compare_column_values(follower_bios_cleaned3, markers_followers_clean, 'follower_id')

#The follower_ids are still streamlined, indicating that writing and reading of the cleaned dfs was successful

There are 0 unique values in df1 that don't exist in df2.
There are 0 unique values in df2 that don't exist in df1.


In [91]:
follower_bios_cleaned3.head(10)

Unnamed: 0,follower_id,screen_name,description,timestamp_utc,location,tweets,followers,friends,likes,lists,description_cleantext
0,30797693,AVMGDIGITALHD,THE NEW DIGITAL STATION!!! \nfollowed by @ROCNATION + @MASTERCARD,1239593000.0,,73584.0,777.0,1504.0,250.0,12.0,THE NEW DIGITAL STATION!!! \nfollowed by @ROCNATION + @MASTERCARD
1,134483898,Ferdlarez,Electricist Professional,1271603000.0,Venezuela,7868.0,204.0,2182.0,1552.0,2.0,Electricist Professional
2,2779899894,nestorale3,En una biografia no me conoceras relamente :3 #Frente #Sur #LGG Real Madrid CR7\n\nLo mio siempre sera una Blanquita☺♥,1409374000.0,Valencia _ Los guayos city,990.0,180.0,1101.0,488.0,0.0,En una biografia no me conoceras relamente :3 #Frente #Sur #LGG Real Madrid CR7\n\nLo mio siempre sera una Blanquita
3,487765672,bbbbbbrieuc,"les gangsters ne dansent pas,\nmais ce soir c'est les cances-va",1328812000.0,,4979.0,88.0,306.0,18415.0,2.0,"les gangsters ne dansent pas,\nmais ce soir c'est les cances-va"
4,3139392851,lermitevvv,,1428301000.0,,17245.0,1033.0,1005.0,14468.0,9.0,
5,889782143513186304,altarocsamu2B,,1500976000.0,,799.0,31.0,430.0,641.0,0.0,
6,464194418,JohanaCrosby,Mi Madre lo es Todo😍\nQue chimba los recuerdos 🍷,1326584000.0,Espagne,771.0,449.0,845.0,2138.0,1.0,Mi Madre lo es Todo\nQue chimba los recuerdos
7,1039234359822364673,MRahmatuallah,,1536608000.0,,790.0,126.0,4842.0,711.0,0.0,
8,263192745,e_freydrich,Tribune Tony Marek #RCL,1299685000.0,France,1683.0,179.0,423.0,601.0,1.0,Tribune Tony Marek #RCL
9,982596266080260096,vuuuzy,"If interstellar tourism was more developed, Earth would be known for its flatbread dishes. So may amazing variants of the same idea all over the globe! #yummy",1523104000.0,France,148.0,74.0,802.0,75.0,0.0,"If interstellar tourism was more developed, Earth would be known for its flatbread dishes. So may amazing variants of the same idea all over the globe! #yummy"
