# Preprocessing Data Tweets

The purpose of this notebook is to normalize and transform dataset columns to improve the quality of the data.

We start by importing the libraries and configuring some settings:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('max_colwidth', 150)

Firstly, we load data from the input csv file into a dataframe:

In [None]:
df = pd.read_csv("2_PreprocessingTweets.csv", sep="|", lineterminator='\n', low_memory=False)
df.head()

In [None]:
df.shape

In [None]:
df.columns

The initial data set has 33.130 records and 21 columns.

We observe that there are unnecessary columns, so we will remove them from the dataframe:

In [None]:
df.drop("Unnamed: 0\r", axis=1, inplace=True)

Now, we transform date columns into a date format:

In [None]:
df['created_date'] = pd.to_datetime(df['created_date'])
df['user_created_account'] = pd.to_datetime(df['user_created_account'])

Also, the rest of columns with string format needs some transformations in order to:

- Remove extra spaces 
- Remove Return (\r) 
- Remove New line (\n) 
- Rename column names

In [None]:
df['tweets'] = df['tweets'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['source'] = df['source'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['language'] = df['language'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['place'] = df['place'].map(lambda x: str(x).replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['user_description'] = df['user_description'].map(lambda x: str(x).replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['user_name'] = df['user_name'].map(lambda x: str(x).replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['user_location'] = df['user_location'].map(lambda x: str(x).replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['user_lang\r'] = df['user_lang\r'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['user_lang'] = df['user_lang\r']
df.drop("user_lang\r", axis=1, inplace=True)

Attending to "Tweets" column, it has been detected some unfinished tweets because of the exceeded lenght (140 characters), so we will remove them from the dataframe:

In [None]:
df = df[df["tweet_long"] < 139]
df.shape

In [None]:
df.head()

Let's take a look if there are repeated tweets. In that case, we will remove all of them:

In [None]:
df["tweets"].unique().shape

In [None]:
df_clean = pd.DataFrame(data=df["tweets"].unique(), columns=["tweets"])
df_clean.shape

We merge unique values with rest of the columns to get the final dataframe:

In [None]:
df_merge = pd.merge(df_clean, df.sort_values("created_date", ascending=True), how='left')
df_final = df_merge.drop_duplicates('tweets', keep='first')

In [None]:
df_final.head()

In [None]:
df_final.shape

Finally, we get the processed dataframe with a total of 5.301 registers and 20 columns. It is saved in a .csv file.

In [None]:
df_final.to_csv('preprocessed_tweets.csv', index=False, header=True,sep='|',decimal=',', encoding='utf-8')