## Data Cleaning Plan

We have three data sets:
- Markers' bios and metadata (markers_bios)
- Followers' bios and metadata (followers_bios)
- All brands and their followers (markers-followers)


Step by step plan:
1. Load the bios of followers, and the marker-follower file. 
    - Provide summary statistics of users and brands. How many brands do we have? How many followers? Any missing data, duplicates etc.?

2. Filter on marker-follower df:
    - Create a dictionary of counts brands per follower
    - Remove users that follow less than 5 (or more) brands
    - Continuously track numbers of users removed
    - Match the Follower_Ids in the now filtered marker-follower df with the follower-bio df. As such, the follower bios will only include users that follow more than five brands. Subsequent filters will be on the correct users (up to date follower-bios).

3. Do the filters on the follower-bios:
    - Remove users with less than 25 followers
    - Remove users with less than 100 tweets
    - Filter based on language: keep only french accounts

4. Match the Ids again. Source is the follower-bios , and target is marker-follower  (we'll thus have a marker-follower list with only the relevant users)








In [2]:
# Standard library imports
import os
import re
import csv
import sys
import html
from datetime import datetime
from collections import defaultdict

# Third-party library imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import dask.dataframe as dd
import psutil
import pickle
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Local application/library specific imports
import utils2
from utils2 import *



from unidecode import unidecode
import importlib

## 1. Load files and summary stats

In [3]:
# # Load the data files that have been renamed already (make distinction between follower_id and marker_id)
importlib.reload(utils2)

# Load markers-followers
#Load marker followers
load_path = '/home/livtollanes/NewData'
file = 'markers_followers_2023-05-19.csv'

req_cols = ['marker_id', 'follower_id']
dtypes = {'marker_id': 'float64',
          'follower_id': 'float64'}

markers_followers = utils2.fileloader(load_path, file, req_cols, dtypes)

In [5]:
# Load the followers bios
load_path = '/home/livtollanes/SocialMarkers'
file = 'markers_followers_bios_2023-05-19.csv'

req_cols = ['twitter_id', 'id', 'screen_name', 'description', 'location', 'tweets', 'followers', 'friends', 'likes', 'lists','timestamp_utc']

dtypes = {
    'twitter_id': 'float64',
    'id': 'float64',
    'screen_name': 'object',
    'description': 'object',
    'location': 'object',
    'tweets': 'float64',
    'followers': 'float64',
    'friends': 'float64',
    'witheld_in_countries': 'float64'
}

followers_bios = utils2.fileloader(load_path, file, req_cols, dtypes)



#rename the twittwer id column to follower id 
followers_bios.rename(columns={'twitter_id':'follower_id'}, inplace=True)

Summary statistics

In [6]:
markers_followers.head(3)

Unnamed: 0,marker_id,follower_id
0,415859364.0,1.655337e+18
1,415859364.0,1.659648e+18
2,415859364.0,1.525534e+18


In [7]:
followers_bios.head(3)

Unnamed: 0,follower_id,id,screen_name,description,timestamp_utc,location,tweets,followers,friends,likes,lists
0,3342215000.0,3342215000.0,titisanogo8,Je crois en DIEU et à mon travail j'y arrivera...,1435018000.0,"Ile-de-France, France",6.0,44.0,733.0,91.0,0.0
1,3115496000.0,3115496000.0,AndreDeybach,,1427309000.0,,0.0,1.0,40.0,0.0,0.0
2,244075000.0,244075000.0,matttownley1985,"Hotelier, traveller, fan of all things hospita...",1296221000.0,"Manchester, England",2535.0,772.0,1264.0,1251.0,7.0


In [27]:
importlib.reload(utils2)
utils2.summary_stats(followers_bios, print_dtypes=False)

Shape of DataFrame:  (70666646, 11)

Columns in DataFrame:  ['follower_id', 'id', 'screen_name', 'description', 'timestamp_utc', 'location', 'tweets', 'followers', 'friends', 'likes', 'lists']

Number of unique values in 'follower_id':  70666351

Number of duplicate values in 'follower_id':  295

Number of unique values in 'id':  70642371

Number of duplicate values in 'id':  24274

Number of missing values in each column:
 follower_id             0
id                  23985
screen_name         23986
description      42027215
timestamp_utc       23985
location         47956041
tweets              23985
followers           23985
friends             23985
likes               23985
lists               23985
dtype: int64


In [27]:
importlib.reload(utils2)
utils2.summary_stats(markers_followers, print_dtypes=False)

Shape of DataFrame:  (126345412, 2)

Columns in DataFrame:  ['marker_id', 'follower_id']

Number of unique values in 'follower_id':  70636002

Number of duplicate values in 'follower_id':  55709410

Number of unique values in 'marker_id':  236

Number of duplicate values in 'marker_id':  126345176

Number of missing values in each column:
 marker_id      0
follower_id    0
dtype: int64


In [34]:
# Create a dictionary to store the original number of rows for each DataFrame
original_num_rows = {
    'markers_followers': 126345412,
    'followers_bios': 70666646
}

In [10]:
duplicates = followers_bios[followers_bios.duplicated('follower_id', keep=False)]
duplicates_sorted = duplicates.sort_values('follower_id')
print(duplicates_sorted)

           follower_id            id      screen_name  \
64631398  7.004164e+17  7.004164e+17       sarr_couse   
42301263  7.004164e+17  7.004164e+17    CoquitoPapi15   
21887413  7.004244e+17  7.004244e+17  Lina_Tamer_Hass   
13503464  7.004244e+17  7.004244e+17  everybdy_lovesb   
37983810  7.006543e+17  7.006543e+17      alexswiftzn   
...                ...           ...              ...   
64262846  1.636459e+18  1.636459e+18   TbcradioT26934   
21013376  1.637583e+18  1.637583e+18       dicorato10   
40323003  1.637583e+18  1.637583e+18    AlMakhtarTall   
39641326  1.638002e+18  1.638002e+18    izzyhomeloans   
44041367  1.638002e+18  1.638002e+18       trixiepots   

                                                description  timestamp_utc  \
64631398                                                NaN   1.455827e+09   
42301263        Chicago, shoes and Views. 🇵🇷  IG:@dripbrady   1.455827e+09   
21887413                                                NaN   1.455829e+09   
135

In [31]:
# Calculate the original number of rows
original_num_rows = len(followers_bios)

# Remove duplicated follower_ids in followers_bios
followers_bios.drop_duplicates(subset='follower_id', keep='first', inplace=True)

# Calculate the new number of rows
new_num_rows = len(followers_bios)

# Calculate and print the number of rows removed
num_rows_removed = original_num_rows - new_num_rows
print(f"Number of rows removed: {num_rows_removed}")

# Print the number of rows left
print(f"Number of rows left: {new_num_rows}")

# Calculate and print the percentage of rows left
percentage_left = (new_num_rows / original_num_rows) * 100
print(f"Percentage of rows left: {percentage_left:.2f}%")

Number of rows removed: 0
Number of rows left: 70666351
Percentage of rows left: 100.00%


## 2. Filter the marker-follower df

- Filter the marker-follower df:
    - Create a dictionary of counts brands per follower

    - Remove users that follow less than 5 (or more) brands

    - Continuously track numbers of users removed
    
    - Match the Follower_Ids in the now filtered marker-follower df with the follower-bio df. As such, the follower bios 
    will only include users that follow more than five brands. Subsequent filters will be on the correct users (up to date follower-bios).

Create a dictionary of counts brands per follower

Remove users that follow less than 5 (or more) brands

In [35]:
importlib.reload(utils2)
n = 5  # minimal number of brands followed required to be included in the analysis
markers_followers_5 = utils2.filter_followers(markers_followers, 'follower_id', n)

66606237 followers follow less than 5 brands (94.30% of the total followers).
After removing these followers, 4029765 followers are left (5.70% of the total followers).


Match the IDs in the filtered marker-follower df with the follower bio df, so that the follower bios only are for those who follow at least 5 brands

In [47]:
importlib.reload(utils2)

markers_followers_5.loc[:, 'follower_id'] = markers_followers_5['follower_id'].astype(float)
follower_bios_5 = utils2.streamline_ids(followers_bios, 'follower_id', markers_followers_5, 'follower_id')

Sanity check failed: The number of unique values in the source column (4029765) and target column (3904958) are not identical.
Removed 66761393 rows.
3904958 rows are left.
