# Processing of Twitter Account Data
In 2022, I made a [Naive Bayes model that detected Twitter bots](https://github.com/karl-aldous-banaag/naive-bayes-twitter-bot-detection/). I revisited my old project and redid the data processing.

For this project, I ysed the output from the data collection in 2022. But, I processed the data in a different way. I prioritized modifying the data in such a way that is more concerned about how the accounts behave.

## Importing Dataset

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("twitter_output.csv")

In [3]:
df.head()

Unnamed: 0,id,account_type,created_at,default_profile,default_profile_image,description,favourites_count,followers_count,geo_enabled,lang,location,profile_background_image,profile_image,screen_name,statuses_count,verified,average_tweets_per_day,account_age
0,787405734442958848,bot,Sat Oct 15 21:32:11 +0000 2016,False,False,"Blame @MoistHorse, Inspired by @MakingInvisibl...",5.0,1995.0,False,,,http://abs.twimg.com/images/themes/theme1/bg.png,http://pbs.twimg.com/profile_images/7874121826...,best_in_dumbest,17209.0,False,,
1,796216118331310080,human,Wed Nov 09 05:01:30 +0000 2016,False,False,Photographing the American West since 1980. I ...,451.0,781.0,False,,United States,http://abs.twimg.com/images/themes/theme1/bg.png,http://pbs.twimg.com/profile_images/8023296328...,CJRubinPhoto,251.0,False,,
2,875949740503859204,human,Sat Jun 17 05:34:27 +0000 2017,False,False,🔥𝖙𝖍𝖊 𝖘𝖆𝖛𝖆𝖌𝖊 𝖌𝖊𝖓𝖙𝖑𝖊𝖒𝖆𝖓🔥 ...,7426.0,261.0,True,,,http://abs.twimg.com/images/themes/theme1/bg.png,http://pbs.twimg.com/profile_images/1579338210...,SVGEGENT,1499.0,False,,
3,756119643622735875,human,Thu Jul 21 13:32:25 +0000 2016,True,False,Wife.Godmother.Friend.Feline Fanatic!,19058.0,751.0,True,,"Birmingham, AL",,http://pbs.twimg.com/profile_images/1284884924...,TinkTinkEDU,2531.0,False,,
4,464781334,human,Sun Jan 15 16:32:35 +0000 2012,False,False,England U21 Assistant Coach | @pumafootball Am...,616.0,736975.0,True,,"England, United Kingdom",http://abs.twimg.com/images/themes/theme1/bg.png,http://pbs.twimg.com/profile_images/1437799096...,JoleonLescott,4750.0,True,,


## Removal of Unnecessary Columns
For this dataset, I will remove `default_profile`, `default_profile_image`, `description`, `geo_enabled`, `lang`, `location`, `profile_background_image`, `profile_image`, `average_tweets_per_day`, `account_age`, and `created_at`. Below are the reasons why these columns will be deleted.

#### default_profile
Bot accounts are likely to change their profile pictures to imitate human accounts. Thus, the profile picture is not relevant in the behavior displayed in the account.

#### default_profile_image
Bot accounts are likely to change their profile pictures to imitate human accounts. Thus, the profile picture is not relevant in the behavior displayed in the account.

#### description
The description does not have an effect on the behavior of the account.

#### geo_enabled
This column does not provide any information about the behavior of the account.

#### lang
The language of the Twitter account does not provide any insights about the nature of the Twitter account.

#### location
Many values in this column are blank or do not provide any information about the location of the account.

#### profile_background_image
The presence or absence of a background image does not affect the behavior displayed in the account.

#### average_tweets_per_day
This column contains no data, but the values of this column will be computed later. To compute for this column, I will count the days between the date in `created_at` and October 25, 2022 (the day when this dataset was created).

#### account_age
This column contains no data, but the values of this column will be computed later.

#### created_at
Bots can be old or new. Regardless, this will not be important since `account_age` will be computed later. However, this column will stay in the beggining of this notebook so that I can derive `average_tweets_per_day` and `account_age`.

In [4]:
df.drop(
    labels = [
        "default_profile", "default_profile_image", "description", "geo_enabled",
        "lang", "location", "profile_background_image", "profile_image",
        "average_tweets_per_day", "account_age"
    ],
    axis = "columns"
).head()

Unnamed: 0,id,account_type,created_at,favourites_count,followers_count,screen_name,statuses_count,verified
0,787405734442958848,bot,Sat Oct 15 21:32:11 +0000 2016,5.0,1995.0,best_in_dumbest,17209.0,False
1,796216118331310080,human,Wed Nov 09 05:01:30 +0000 2016,451.0,781.0,CJRubinPhoto,251.0,False
2,875949740503859204,human,Sat Jun 17 05:34:27 +0000 2017,7426.0,261.0,SVGEGENT,1499.0,False
3,756119643622735875,human,Thu Jul 21 13:32:25 +0000 2016,19058.0,751.0,TinkTinkEDU,2531.0,False
4,464781334,human,Sun Jan 15 16:32:35 +0000 2012,616.0,736975.0,JoleonLescott,4750.0,True


The code above dropped the unnecessary columns, so it is safe to set `df` to 
`df.drop(labels = ["default_profile", "default_profile_image", "geo_enabled", "lang", "location", "profile_background_image", "profile_image", "average_tweets_per_day", "account_age"], axis = "columns")`.

In [5]:
df = df.drop(
    labels = [
        "default_profile", "default_profile_image", "description", "geo_enabled",
        "lang", "location", "profile_background_image", "profile_image",
        "average_tweets_per_day", "account_age"
    ],
    axis = "columns"
)

## Modifying Dataset

### Converting `created_at` to datetime

In [6]:
df.dtypes

id                    int64
account_type         object
created_at           object
favourites_count    float64
followers_count     float64
screen_name          object
statuses_count      float64
verified             object
dtype: object

`created_at` has a type of object, so I will convert it to datetime using the `pd.to_datetime` function. 

In [7]:
df["created_at"] = pd.to_datetime(df["created_at"])

The code below verifies that the column was converted successfully.

In [8]:
df.dtypes

id                                int64
account_type                     object
created_at          datetime64[ns, UTC]
favourites_count                float64
followers_count                 float64
screen_name                      object
statuses_count                  float64
verified                         object
dtype: object

### Dropping Rows with Missing Values

Before I remove rows with missing values, I first want to check how many values I have at the moment.

In [9]:
len(df)

37438

After checking the number of rows, I want to see how many rows I will lose when I remove the rows with missing values.

In [10]:
pd.DataFrame(df.isnull().sum())

Unnamed: 0,0
id,0
account_type,0
created_at,1788
favourites_count,1788
followers_count,1788
screen_name,1788
statuses_count,1788
verified,1788


I will only lose <b>1788</b> rows. I ignored the missing values in the `description` column since Twitter accounts can exist without descriptions. So, I decided to remove the columns with missing values.

In [11]:
df = df.dropna(subset=["created_at", "favourites_count", "followers_count", "screen_name", "statuses_count", "verified"])

### Count Age of Accounts in Days

In [12]:
import datetime
import numpy as np
import math

In [13]:
pd.to_datetime("Oct 25 0:00:00 +0000 2022")

Timestamp('2022-10-25 00:00:00+0000', tz='UTC')

#### Computing

In [14]:
df["account_age"] = ((pd.to_datetime("Oct 25 0:00:00 +0000 2022") - df["created_at"]) / np.timedelta64(1, 'D')).astype(int)

In [15]:
df.head()

Unnamed: 0,id,account_type,created_at,favourites_count,followers_count,screen_name,statuses_count,verified,account_age
0,787405734442958848,bot,2016-10-15 21:32:11+00:00,5.0,1995.0,best_in_dumbest,17209.0,False,2200
1,796216118331310080,human,2016-11-09 05:01:30+00:00,451.0,781.0,CJRubinPhoto,251.0,False,2175
2,875949740503859204,human,2017-06-17 05:34:27+00:00,7426.0,261.0,SVGEGENT,1499.0,False,1955
3,756119643622735875,human,2016-07-21 13:32:25+00:00,19058.0,751.0,TinkTinkEDU,2531.0,False,2286
4,464781334,human,2012-01-15 16:32:35+00:00,616.0,736975.0,JoleonLescott,4750.0,True,3935


### Removing `created_at` Column
Now that I have the `account_age` column, I do not need the `created_at` column any longer.

In [16]:
df = df.drop(labels = ["created_at"], axis = "columns")
df.head()

Unnamed: 0,id,account_type,favourites_count,followers_count,screen_name,statuses_count,verified,account_age
0,787405734442958848,bot,5.0,1995.0,best_in_dumbest,17209.0,False,2200
1,796216118331310080,human,451.0,781.0,CJRubinPhoto,251.0,False,2175
2,875949740503859204,human,7426.0,261.0,SVGEGENT,1499.0,False,1955
3,756119643622735875,human,19058.0,751.0,TinkTinkEDU,2531.0,False,2286
4,464781334,human,616.0,736975.0,JoleonLescott,4750.0,True,3935


### Computing for Average Number of Posts per Day

In [17]:
df["average_tweets_per_day"] = df["statuses_count"] / df["account_age"]
df.head()

Unnamed: 0,id,account_type,favourites_count,followers_count,screen_name,statuses_count,verified,account_age,average_tweets_per_day
0,787405734442958848,bot,5.0,1995.0,best_in_dumbest,17209.0,False,2200,7.822273
1,796216118331310080,human,451.0,781.0,CJRubinPhoto,251.0,False,2175,0.115402
2,875949740503859204,human,7426.0,261.0,SVGEGENT,1499.0,False,1955,0.766752
3,756119643622735875,human,19058.0,751.0,TinkTinkEDU,2531.0,False,2286,1.107174
4,464781334,human,616.0,736975.0,JoleonLescott,4750.0,True,3935,1.207116


If the `statuses_count` and `average_tweets_per_day` columns were to coexist while making the model, the number of tweets would be considered in two columns of the dataset. So, I must remove the `statuses_count` column.

In [18]:
df = df.drop(labels = ["statuses_count"], axis = "columns")
df.head()

Unnamed: 0,id,account_type,favourites_count,followers_count,screen_name,verified,account_age,average_tweets_per_day
0,787405734442958848,bot,5.0,1995.0,best_in_dumbest,False,2200,7.822273
1,796216118331310080,human,451.0,781.0,CJRubinPhoto,False,2175,0.115402
2,875949740503859204,human,7426.0,261.0,SVGEGENT,False,1955,0.766752
3,756119643622735875,human,19058.0,751.0,TinkTinkEDU,False,2286,1.107174
4,464781334,human,616.0,736975.0,JoleonLescott,True,3935,1.207116


### Computing for Average Liking of Tweets per Day

In [19]:
df["average_tweets_liked_per_day"] = df["favourites_count"] / df["account_age"]
df.head()

Unnamed: 0,id,account_type,favourites_count,followers_count,screen_name,verified,account_age,average_tweets_per_day,average_tweets_liked_per_day
0,787405734442958848,bot,5.0,1995.0,best_in_dumbest,False,2200,7.822273,0.002273
1,796216118331310080,human,451.0,781.0,CJRubinPhoto,False,2175,0.115402,0.207356
2,875949740503859204,human,7426.0,261.0,SVGEGENT,False,1955,0.766752,3.798465
3,756119643622735875,human,19058.0,751.0,TinkTinkEDU,False,2286,1.107174,8.336833
4,464781334,human,616.0,736975.0,JoleonLescott,True,3935,1.207116,0.156544


If the `favourites_count` and `average_tweets_liked_per_day` columns were to coexist while making the model, the number of liked tweets would be considered in two columns. So, I must remove the `favourites_count` column.

In [20]:
df = df.drop(labels = ["favourites_count"], axis = "columns")
df.head()

Unnamed: 0,id,account_type,followers_count,screen_name,verified,account_age,average_tweets_per_day,average_tweets_liked_per_day
0,787405734442958848,bot,1995.0,best_in_dumbest,False,2200,7.822273,0.002273
1,796216118331310080,human,781.0,CJRubinPhoto,False,2175,0.115402,0.207356
2,875949740503859204,human,261.0,SVGEGENT,False,1955,0.766752,3.798465
3,756119643622735875,human,751.0,TinkTinkEDU,False,2286,1.107174,8.336833
4,464781334,human,736975.0,JoleonLescott,True,3935,1.207116,0.156544


## Write Clean Data as .csv File

In [21]:
df.to_csv("twitter_output_clean.csv", index = False)