# Clean Data

* python script to perform initial processing of data
* remove non-English tweets
* remove duplicate tweets

## Install required libraries

In [1]:
!pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l[K     |▍                               | 10 kB 23.8 MB/s eta 0:00:01[K     |▊                               | 20 kB 8.9 MB/s eta 0:00:01[K     |█                               | 30 kB 12.2 MB/s eta 0:00:01[K     |█▍                              | 40 kB 5.0 MB/s eta 0:00:01[K     |█▊                              | 51 kB 4.9 MB/s eta 0:00:01[K     |██                              | 61 kB 5.8 MB/s eta 0:00:01[K     |██▍                             | 71 kB 5.7 MB/s eta 0:00:01[K     |██▊                             | 81 kB 6.4 MB/s eta 0:00:01[K     |███                             | 92 kB 4.9 MB/s eta 0:00:01[K     |███▍                            | 102 kB 5.4 MB/s eta 0:00:01[K     |███▊                            | 112 kB 5.4 MB/s eta 0:00:01[K     |████                            | 122 kB 5

## Import required libraries

In [2]:
import numpy as np
import pandas as pd
import nltk
from google.colab import drive
from langdetect import detect

## Read dataset

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/nlp project/Dataset/biden_tweets_raw.csv')

df

Unnamed: 0.1,Unnamed: 0,Datetime,Tweet Id,Text,Username
0,0,2020-12-03 23:59:59+00:00,1334648608638054400,@JoeBiden iste bu be sonuna kadar joe biden,EmirhanCil7
1,1,2020-12-03 23:59:58+00:00,1334648604200415232,"Yet you are uncertain why 23,000+ votes (98% f...",MLKing2
2,2,2020-12-03 23:59:57+00:00,1334648599477690380,Right. Sure. You bet. Joe Biden is the Preside...,gfounder1
3,3,2020-12-03 23:59:57+00:00,1334648598475108357,@AlexisJones1969 @theangiestanton I guess you ...,_Quetzy_
4,4,2020-12-03 23:59:56+00:00,1334648597825146882,Great job team. Biden is just reinstalling al...,sphincter987
...,...,...,...,...,...
19997,9996,2020-12-04 22:57:40+00:00,1334995312637775872,Alito Responds To Appeal Asking To Block Biden...,VeraldoF4F
19998,9997,2020-12-04 22:57:40+00:00,1334995312335699969,@realDonaldTrump What a freeeeakeeennn Joke yo...,bfrando
19999,9998,2020-12-04 22:57:39+00:00,1334995311350177795,@realDonaldTrump Thanks Biden! https://t.co/th...,BlueLanternUSA
20000,9999,2020-12-04 22:57:39+00:00,1334995310226116608,Urge President-elect Biden to undo the damage ...,RickeyButtery


In [5]:
# locate column with texts
text_data_raw = df['Text'].values

## Remove non-English tweets

In [6]:
# remove non-english tweets identified
# getting the tweet languages
languages = []
text_data = []
en_count = 0
for text in text_data_raw:
  if text != '':
    try:
      language = detect(text)
      if language == 'en':
        en_count = en_count + 1
        text_data.append(text)
    except:
      continue
    languages.append(language)

In [7]:
np.unique(languages, return_counts = True)

(array(['af', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'fi',
        'fr', 'hr', 'hu', 'id', 'it', 'ja', 'ko', 'lt', 'nl', 'no', 'pl',
        'pt', 'ro', 'ru', 'sl', 'so', 'sq', 'sv', 'sw', 'tl', 'tr', 'vi',
        'zh-cn'], dtype='<U5'),
 array([   36,    38,     2,     6,    68,   213,     1, 18021,   689,
            8,    11,   142,     1,     3,    47,    42,     4,     1,
            1,   238,    37,     9,   123,   118,     2,     7,     8,
            1,    31,     1,     7,    70,    12,     2]))

In [8]:
print("Number of English texts: " + str(en_count))

Number of English texts: 18021


In [9]:
text_data

['Yet you are uncertain why 23,000+ votes (98% for Biden) were logged at that same location shortly after midnight?\n\n@RealJamesWoods @RealRLimbaugh @RealBasedMAGA',
 'Right. Sure. You bet. Joe Biden is the President and you’re just going to have to accept it. Sound familiar?',
 "@AlexisJones1969 @theangiestanton I guess you don't even half a quarter of one if you think Biden isn't your new president.",
 'Great job team.  Biden is just reinstalling all the people the failed in the BO admin. \n\nFilling the swamp back in.',
 '@bookmaker_eu Will I get paid out on my Biden to win the election bet before 2021?\n\nYes +300\nNo -200',
 '@larryelder Little known fact – that Biden’s support for Thatcher in 1982 was not the first time that the US helped kick Argie ass? Lexington Raid 1831 anyone? \nhttps://t.co/gWrMmEJWJq\nOr p.101  👇\nhttps://t.co/v9LH3geguu',
 '@JoeBiden I speak for all Americans when I say Joe Biden is our President, Trump should have never won 2016, and I can’t wait to fli

In [10]:
# store English tweets as new df
clean_df = pd.DataFrame(text_data, columns = ['Text'])

clean_df

Unnamed: 0,Text
0,"Yet you are uncertain why 23,000+ votes (98% f..."
1,Right. Sure. You bet. Joe Biden is the Preside...
2,@AlexisJones1969 @theangiestanton I guess you ...
3,Great job team. Biden is just reinstalling al...
4,@bookmaker_eu Will I get paid out on my Biden ...
...,...
18016,Alito Responds To Appeal Asking To Block Biden...
18017,@realDonaldTrump What a freeeeakeeennn Joke yo...
18018,@realDonaldTrump Thanks Biden! https://t.co/th...
18019,Urge President-elect Biden to undo the damage ...


## Remove duplicate tweets

In [11]:
clean_df = clean_df.drop_duplicates(keep='first')

clean_df

Unnamed: 0,Text
0,"Yet you are uncertain why 23,000+ votes (98% f..."
1,Right. Sure. You bet. Joe Biden is the Preside...
2,@AlexisJones1969 @theangiestanton I guess you ...
3,Great job team. Biden is just reinstalling al...
4,@bookmaker_eu Will I get paid out on my Biden ...
...,...
18016,Alito Responds To Appeal Asking To Block Biden...
18017,@realDonaldTrump What a freeeeakeeennn Joke yo...
18018,@realDonaldTrump Thanks Biden! https://t.co/th...
18019,Urge President-elect Biden to undo the damage ...


## Export cleaned data

In [12]:
# shuffle dataframe rows
clean_df_shuffled = clean_df.sample(frac=1, random_state=42).reset_index(drop=True)

clean_df_shuffled

Unnamed: 0,Text
0,Biden talks about his resignation plans 🙄
1,@FoxNews We all know how Gov work. Biden said ...
2,@DonnaFEdwards @JoeBiden @neeratanden @BrianCD...
3,@realDonaldTrump Joe Biden did it again. Great...
4,"@JenGriffinFNC AP News may have said so, but n..."
...,...
17632,"@AkonFenty @willchamberlain Yeah, thats why ov..."
17633,"GAME OVER, BIDEN! EVIDENCIAS EM VÍDEO. FORAM P..."
17634,@HarleyQ11341281 @RudyGiuliani Joe 'Scrambled ...
17635,@c5hardtop1999 #FBI Dir. Wray Profited from Hu...


In [13]:
clean_file_name = 'biden_tweets_clean.csv'
  
clean_df_shuffled.to_csv(clean_file_name)