# Cleaning and Organizing data extracted from twitter with Octoparse

In [68]:
import pandas as pd

# Getting the datasets

The csv files below was acquired using web scraping via Octoparse. \
Link Octoparse: https://www.octoparse.com/ \
Link PET-MA tutorial to scrap tweets from twitter using octoparse:

In [69]:
df_tweets = pd.read_csv("/content/drive/MyDrive/DI - Analise Bitcoin - GPU/4. Execução/0. Dados Coletados/all_tweets.csv", names=["Field1", "Text", "Text1", "Text2"] )
df_tweets

Unnamed: 0,Field1,Text,Text1,Text2
0,Field1,Text,Text1,Text2
1,PlanB\n@100trillionUSD\n·\n3 de out\nDo you th...,"Do you think #bitcoin will reach $500K, $288K ...",@100trillionUSD,3 de out
2,PlanB\n@100trillionUSD\n·\n11 de set\nDo you t...,"Do you think #bitcoin will reach $500K, $288K ...",@100trillionUSD,11 de set
3,PlanB\n@100trillionUSD\n·\n7 de set\n43k\n979\...,43k,@100trillionUSD,7 de set
4,PlanB\n@100trillionUSD\n·\n15 de ago\nDo you t...,"Do you think #bitcoin will reach $500K, $288K ...",@100trillionUSD,15 de ago
...,...,...,...,...
67159,Willy Woo\n@woonomic\n·\n30 de nov de 2012\nAn...,And it's started... #tedxtearo nice work guys!,@woonomic,30 de nov de 2012
67160,Willy Woo\n@woonomic\n·\n7 de mar de 2011\nOn ...,On knowledge: Circa 1400 a book cost you the p...,@woonomic,7 de mar de 2011
67161,Willy Woo\n@woonomic\n·\n9 de out de 2009\nOpe...,"Opened up my new MacBook Pro 13 Unibody. Wow, ...",@woonomic,9 de out de 2009
67162,Willy Woo\n@woonomic\n·\n29 de jul de 2009\nGM...,GMail Tip: just discovered you can drag label...,@woonomic,29 de jul de 2009


# Adjusting dataframe

## Removing unnecessary lines and columns

In [70]:
df_tweets.pop("Field1")
df_tweets.drop(index=0, axis=0, inplace=True)

## Renaming the columns

In [71]:
new_heades = {
    # current name : new name
    'Text' : 'Tweet',
    'Text1' : 'User', 
    'Text2': 'Date'
}

df_tweets.rename(columns=new_heades, inplace=True)
df_tweets

Unnamed: 0,Tweet,User,Date
1,"Do you think #bitcoin will reach $500K, $288K ...",@100trillionUSD,3 de out
2,"Do you think #bitcoin will reach $500K, $288K ...",@100trillionUSD,11 de set
3,43k,@100trillionUSD,7 de set
4,"Do you think #bitcoin will reach $500K, $288K ...",@100trillionUSD,15 de ago
5,So was $64K the top for this cycle (halving 20...,@100trillionUSD,1 de jul
...,...,...,...
67159,And it's started... #tedxtearo nice work guys!,@woonomic,30 de nov de 2012
67160,On knowledge: Circa 1400 a book cost you the p...,@woonomic,7 de mar de 2011
67161,"Opened up my new MacBook Pro 13 Unibody. Wow, ...",@woonomic,9 de out de 2009
67162,GMail Tip: just discovered you can drag label...,@woonomic,29 de jul de 2009


# Changing column data type

In [72]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67163 entries, 1 to 67163
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Tweet   66921 non-null  object
 1   User    67146 non-null  object
 2   Date    67132 non-null  object
dtypes: object(3)
memory usage: 2.0+ MB


Tweet and User column will be change from object to String

In [73]:
df_tweets['Tweet'].astype(str)

1        Do you think #bitcoin will reach $500K, $288K ...
2        Do you think #bitcoin will reach $500K, $288K ...
3                                                      43k
4        Do you think #bitcoin will reach $500K, $288K ...
5        So was $64K the top for this cycle (halving 20...
                               ...                        
67159       And it's started... #tedxtearo nice work guys!
67160    On knowledge: Circa 1400 a book cost you the p...
67161    Opened up my new MacBook Pro 13 Unibody. Wow, ...
67162    GMail Tip:  just discovered you can drag label...
67163    Giving TweetDeck a go... in my camper...  Out ...
Name: Tweet, Length: 67163, dtype: object

In [74]:
df_tweets['User'].astype(str)

1        @100trillionUSD
2        @100trillionUSD
3        @100trillionUSD
4        @100trillionUSD
5        @100trillionUSD
              ...       
67159          @woonomic
67160          @woonomic
67161          @woonomic
67162          @woonomic
67163          @woonomic
Name: User, Length: 67163, dtype: object

Before changing the data type of the Date column it is necessary to change some values that are incorrect

In [79]:
df_tweets['Date']

1                  3 de out
2                 11 de set
3                  7 de set
4                 15 de ago
5                  1 de jul
                ...        
67159     30 de nov de 2012
67160      7 de mar de 2011
67161      9 de out de 2009
67162     29 de jul de 2009
67163    24 de mar de 2009
Name: Date, Length: 67163, dtype: object