## Data Cleaning Plan

We have three data sets:
- Markers' bios and metadata (markers_bios)
- Followers' bios and metadata (followers_bios)
- All brands and their followers (markers-followers)


Step by step plan:
1. Load the bios of followers, and the marker-follower file. 
    - Provide summary statistics of users and brands. How many brands do we have? How many followers? Any missing data, duplicates etc.?

2. Filter on marker-follower df:
    - Create a dictionary of counts brands per follower
    - Remove users that follow less than 5 (or more) brands
    - Continuously track numbers of users removed
    - Match the Follower_Ids in the now filtered marker-follower df with the follower-bio df. As such, the follower bios will only include users that follow more than five brands. Subsequent filters will be on the correct users (up to date follower-bios).

3. Do the filters on the follower-bios:
    - Remove users with less than 25 followers
    - Remove users with less than 100 tweets

4. Filter based on language: keep only french accounts









In [2]:
# Standard library imports
import os
import re
import csv
import sys
import html
from datetime import datetime
from collections import defaultdict

# Third-party library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import dask.dataframe as dd
import psutil
import pickle
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Local application/library specific imports
import utils2
from utils2 import *



from unidecode import unidecode
import importlib

## 1. Load files and summary stats

In [4]:
# # Load the data files that have been renamed already (make distinction between follower_id and marker_id)
importlib.reload(utils2)

# Load markers-followers
#Load marker followers
load_path = '/home/livtollanes/NewData'
file = 'markers_followers_2023-05-19.csv'

req_cols = ['marker_id', 'follower_id']
dtypes = {'marker_id': 'float64',
          'follower_id': 'float64'}

markers_followers = utils2.fileloader(load_path, file, req_cols, dtypes)

In [29]:
# Load the followers bios
load_path = '/home/livtollanes/SocialMarkers'
file = 'markers_followers_bios_2023-05-19.csv'

req_cols = ['twitter_id', 'id', 'screen_name', 'description', 'location', 'tweets', 'followers', 'friends', 'likes', 'lists','timestamp_utc']

dtypes = {
    'twitter_id': 'float64',
    'id': 'float64',
    'screen_name': 'object',
    'description': 'object',
    'location': 'object',
    'tweets': 'float64',
    'followers': 'float64',
    'friends': 'float64',
    'witheld_in_countries': 'float64'
}

followers_bios = utils2.fileloader(load_path, file, req_cols, dtypes)



#rename the twittwer id column to follower id 
followers_bios.rename(columns={'twitter_id':'follower_id'}, inplace=True)

Summary statistics

In [6]:
importlib.reload(utils2)
utils2.summary_stats(followers_bios, print_dtypes=False)

Shape of DataFrame:  (70666646, 11)

Columns in DataFrame:  ['follower_id', 'id', 'screen_name', 'description', 'timestamp_utc', 'location', 'tweets', 'followers', 'friends', 'likes', 'lists']

Number of unique values in 'follower_id':  70666351

Number of duplicate values in 'follower_id':  295

Number of unique values in 'id':  70642371

Number of duplicate values in 'id':  24274

Number of missing values in each column:
 follower_id             0
id                  23985
screen_name         23986
description      42027215
timestamp_utc       23985
location         47956041
tweets              23985
followers           23985
friends             23985
likes               23985
lists               23985
dtype: int64


In [7]:
importlib.reload(utils2)
utils2.summary_stats(markers_followers, print_dtypes=False)

Shape of DataFrame:  (126345412, 2)

Columns in DataFrame:  ['marker_id', 'follower_id']

Number of unique values in 'follower_id':  70636002

Number of duplicate values in 'follower_id':  55709410

Number of unique values in 'marker_id':  236

Number of duplicate values in 'marker_id':  126345176

Number of missing values in each column:
 marker_id      0
follower_id    0
dtype: int64


In [30]:
# Convert the 'follower_id' column to float in both dataframes
followers_bios['follower_id'] = followers_bios['follower_id'].astype(float)
markers_followers['follower_id'] = markers_followers['follower_id'].astype(float)

In [11]:
compare_column_values(followers_bios, markers_followers, 'follower_id')

There are 2463629 unique values in df1 that don't exist in df2.
There are 2433280 unique values in df2 that don't exist in df1.


In [16]:
importlib.reload(utils2)
missing_in_followers_bios, missing_in_markers_followers = utils2.get_discrepancies(followers_bios, markers_followers, 'follower_id')

print("Missing in followers_bios:")
print(missing_in_followers_bios.head(10))

print("Missing in markers_followers:")
print(missing_in_markers_followers.head(10))

Missing in followers_bios:
8      9.568676e+17
10     9.561652e+17
57     9.145880e+17
106    9.670257e+17
209    9.130761e+17
222    1.273530e+18
267    9.344244e+17
281    9.272911e+17
310    1.008815e+18
338    1.039234e+18
Name: follower_id, dtype: float64
Missing in markers_followers:
0      1.655337e+18
141    1.659624e+18
177    1.659617e+18
242    1.659602e+18
249    1.659601e+18
265    1.659597e+18
313    1.597933e+18
316    1.221570e+18
324    1.659590e+18
332    1.583930e+18
Name: follower_id, dtype: float64


In [22]:
duplicates = followers_bios[followers_bios.duplicated('follower_id', keep=False)]
duplicates_sorted = duplicates.sort_values('follower_id')
print(len(duplicates_sorted))

590


In [31]:
# Calculate the original number of rows
original_num_rows = len(followers_bios)

# Remove duplicated follower_ids in followers_bios
followers_bios_nd = followers_bios.drop_duplicates(subset='follower_id', keep='first')

# Calculate the new number of rows
new_num_rows = len(followers_bios_nd)

# Calculate and print the number of rows removed
num_rows_removed = original_num_rows - new_num_rows
print(f"Number of rows removed: {num_rows_removed}")

# Print the number of rows left
print(f"Number of rows left: {new_num_rows}")

Number of rows removed: 295
Number of rows left: 70666351


In [33]:
# Get the duplicate 'follower_id's from followers_bios
duplicate_ids = followers_bios[followers_bios.duplicated('follower_id', keep=False)]['follower_id'].unique()

# Get the number of rows before removal
rows_before = len(markers_followers)

# Remove rows in markers_followers that have the same 'follower_id's
markers_followers_nd = markers_followers[~markers_followers['follower_id'].isin(duplicate_ids)]

# Get the number of rows after removal
rows_after = len(markers_followers_nd)

# Calculate and print the number of rows removed
rows_removed = rows_before - rows_after
print(f"Removed {rows_removed} rows from markers_followers.")

# Print the number of rows left
print(f"Number of rows left in markers_followers after removal: {rows_after}")

Removed 881 rows from markers_followers.
Number of rows left in markers_followers after removal: 126344531


## 2. Filter the marker-follower df

- Filter the marker-follower df:
    - Create a dictionary of counts brands per follower

    - Remove users that follow less than 5 (or more) brands

    - Continuously track numbers of users removed
    
    - Match the Follower_Ids in the now filtered marker-follower df with the follower-bio df. As such, the follower bios 
    will only include users that follow more than five brands. Subsequent filters will be on the correct users (up to date follower-bios).

Remove users that follow less than 5 brands

In [34]:
importlib.reload(utils2)
n = 5  # minimal number of brands followed required to be included in the analysis
markers_followers_5 = utils2.filter_followers(markers_followers_nd, 'follower_id', n)

66606010 followers follow less than 5 brands (94.30% of the total followers).
After removing these followers, 4029718 followers are left (5.70% of the followers in the inputted df).


Match the IDs in the filtered marker-follower df with the follower bio df, so that the follower bios only are for those who follow at least 5 brands

In [36]:
importlib.reload(utils2)

followers_bios_nd5 = utils2.streamline_IDs(markers_followers_5, followers_bios_nd, 'follower_id')

Number of unique follower_id in source: 4029718
Number of unique follower_id in df_tofilter after filtering: 3904911
Removed 66761440 rows.
3904911 rows are left.


In [37]:
compare_column_values(followers_bios_nd5, markers_followers_5, 'follower_id')   

There are 0 unique values in df1 that don't exist in df2.
There are 124807 unique values in df2 that don't exist in df1.


In [39]:
#find length of unique values in markers_followers_5
print(len(markers_followers_5['follower_id'].unique()))

4029718


In [40]:
# what % is 124807  of 4029718
print((124807/4029718)*100)


3.097164615489223


## 3. Do the filters on the follower-bios:
- Remove users with less than 25 followers
- Remove users with less than 100 tweets
- Update the markers-followers df to match the now filtered bio df
- Filter based on language: keep only french accounts


In [38]:
followers_bios_nd5.columns

Index(['follower_id', 'id', 'screen_name', 'description', 'timestamp_utc',
       'location', 'tweets', 'followers', 'friends', 'likes', 'lists'],
      dtype='object')

In [42]:
importlib.reload(utils2)
followers_bios_nd5_tweets_followers = utils2.filter_by_tweets_and_followers(followers_bios_nd5, min_followers= 25, min_tweets= 100)


Removed 2684161 rows.
1220750 rows are left.


Again, remove the follower_Ids in markers-followers that don't occur in the newly filtered  followers_bios_nd5_tweets_followers

In [43]:
markers_followers_5_tweets_followers = utils2.streamline_IDs(source= followers_bios_nd5_tweets_followers, df_tofilter=markers_followers_5, column='follower_id')

Number of unique follower_id in source: 1220750
Number of unique follower_id in df_tofilter after filtering: 1220750
Removed 19422533 rows.
9798809 rows are left.


In [51]:
compare_column_values(followers_bios_nd5_tweets_followers, markers_followers_5_tweets_followers, 'follower_id')

There are 0 unique values in df1 that don't exist in df2.
There are 0 unique values in df2 that don't exist in df1.


In [52]:
#Now write the two dfs to csvs to save them in case something happens
# followers_bios_nd5_tweets_followers.to_csv('/home/livtollanes/NewData/workdata/followers_bios_cleaned_nolang.csv', index=False)
# markers_followers_5_tweets_followers.to_csv('/home/livtollanes/NewData/workdata/markers_followers_cleaned_nolang.csv', index=False)


#These have been messed up due to emojies. i should redo - ugh - the emnture code above, and then read to csv with encoding type specifoed
#df.to_csv('filename.csv', encoding='utf-8', index=False)
#df = pd.read_csv('filename.csv', encoding='utf-8')

## 4. Filter based on language: keep only french accounts
- Use language recognition alorithms to filter the follower_bios. 
- We only want french language bios to be included


In [None]:
# If done from non-ran Kernel, load the dataframes from csvs. remember to look in the wordata dir

In [3]:
#Load marker followers
load_path = '/home/livtollanes/NewData/workdata'
file = 'markers_followers_cleaned_nolang.csv'

req_cols = ['marker_id', 'follower_id']
dtypes = {'marker_id': 'float64',
          'follower_id': 'float64'}

markers_followers_clean = utils2.fileloader(load_path, file, req_cols, dtypes)

In [15]:
# Load the followers bios
full_path = '/home/livtollanes/NewData/workdata/followers_bios_cleaned_nolang.csv'

req_cols = ['follower_id', 'screen_name', 'description', 'location', 'tweets', 'followers', 'friends', 'likes', 'lists','timestamp_utc']

dtypes = {
    'follower_id': 'object',
    'screen_name': 'object',
    'description': 'object',
    'location': 'object',
    'tweets': 'float64',
    'followers': 'float64',
    'friends': 'float64',
    'witheld_in_countries': 'float64'
}

follower_bios_cleaned = pd.read_csv(full_path, usecols=req_cols, dtype=dtypes, engine='python')

In [24]:
# Convert the 'follower_id' column to float, coercing errors to NaN
follower_bios_cleaned['follower_id_float'] = pd.to_numeric(follower_bios_cleaned['follower_id'], errors='coerce')

# Find rows where 'follower_id_float' is NaN
different_rows = follower_bios_cleaned[follower_bios_cleaned['follower_id_float'].isna()]

# Print the 'follower_id' values for these rows
print(different_rows['follower_id'])

16143                         Soy musulmán y estoy orgulloso
16144       Facebook : http://www.facebook.com/darmawan.t...
29181                                           15  Añitos 😍
29182                                             Soltero ✌ 
29183                                        Wsp:2317514194💖
29184                                  Facebook:Alee Ortiz 🙊
29185                                            Deel Nueve😎
29223                                                    NaN
30828                                              Some hope
30829                          in this world is necessary...
30978                                                    NaN
37304                                           Man United ⚽
39537            FUN ; FAIL ; COOL ; OMG ; WTF And More ! ;)
44742                            | D.R Congo Kinshasa City |
52251                                          snap: cloe_on
67350                 SabEr muchO D' Mii hacE DañoO Ơ̴̴͡.̮Ơ͡
75234                Exp

In [25]:
print(different_rows)

                                               follower_id  \
16143                       Soy musulmán y estoy orgulloso   
16144     Facebook : http://www.facebook.com/darmawan.t...   
29181                                         15  Añitos 😍   
29182                                           Soltero ✌    
29183                                      Wsp:2317514194💖   
29184                                Facebook:Alee Ortiz 🙊   
29185                                          Deel Nueve😎   
29223                                                  NaN   
30828                                            Some hope   
30829                        in this world is necessary...   
30978                                                  NaN   
37304                                         Man United ⚽   
39537          FUN ; FAIL ; COOL ; OMG ; WTF And More ! ;)   
44742                          | D.R Congo Kinshasa City |   
52251                                        snap: cloe_on   
67350   

In [19]:
follower_bios_cleaned.head(3)

Unnamed: 0,follower_id,screen_name,description,timestamp_utc,location,tweets,followers,friends,likes,lists
0,30797693.0,AVMGDIGITALHD,THE NEW DIGITAL STATION!!! \nfollowed by @ROCN...,1239593000.0,,73584.0,777.0,1504.0,250.0,12.0
1,134483898.0,Ferdlarez,Electricist Professional,1271603000.0,Venezuela,7868.0,204.0,2182.0,1552.0,2.0
2,2779899894.0,nestorale3,En una biografia no me conoceras relamente :3 ...,1409374000.0,Valencia _ Los guayos city,990.0,180.0,1101.0,488.0,0.0
