# Data Cleansing

Data cleansing is the most important process in all data science work.
The purpose of this notebook is to get cleaned data of *Tweets* column. As a social networks, informal language is used on Twitter and there could be misprint words, specific elements as hashtags, users name, links, etc. In order to clean data from these cases, we will perform the following actions:

- Round 0: remove extra spaces. 
- Round 1: remove mentions, hashtags, links or URLs and "RT" words.
- Round 2: set text to lowercase, remove all punctuation marks and remove numbers that are contained on words.
- Round 3: remove accent marks.
- Round 4: remove repeated letters on words.
- Round 5: convert laught expressions into a representative word.

We start by importing the libraries and configuring some settings:

In [None]:
import re
import string
import unicodedata
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('max_colwidth',150)

Firstly, we load the processed data from the input csv file:

In [None]:
df = pd.read_csv("data/3_Data_Cleansing_8M.csv", sep="|", lineterminator='\n', low_memory=False)
df.head()

In [None]:
df.shape

There are 5.301 registers and 20 columns.

## Round 0:

The following processes will replace, modify or delete the dirty or coarse data from "Tweets" column, specifically:

- Spaces at the begining and the end of tweets
- Extra spaces
- Return (\r)
- New line (\n)


In [None]:
df['tweets'] = df['tweets'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df['tweets'].sample(5)

Some sample tweets:

In [None]:
tweet_sample1 = df['tweets'].iloc[3]
tweet_sample2 = df['tweets'].iloc[3]
tweet_sample3 = df['tweets'].iloc[3]
tweet_sample4 = df['tweets'].iloc[3]
tweet_sample5 = df['tweets'].iloc[1432]

### Round 1:

First round consists on removing some elements on tweets, the result will be tweets without:
- user mentions 
- hashtags 
- links or URLs starting with "https://"
- "RT" words
- "vía" words

In [None]:
def cleaning_tweets(text, letters):
    """
    Function that receives a text and a list of elements to be remove from the text. 
    The output will be the initial text without words that start with those elements.
    """
    words = text.split()
    for letter in letters:
        clean_text = []
        for word in words:
            if not word.startswith(letter):
                clean_text.append(word)
        words = clean_text
        clean_text = ' '.join(clean_text)
    return clean_text

round1 = lambda x: cleaning_tweets(x,["@", "#", "https://", "RT", "vía"])

In [None]:
df['tweets'] = df['tweets'].map(round1)
df.shape

Here is a sample how the cleaned tweet variable looks like after Round 0 and Round 1: 

In [None]:
print("Original tweet -> " + tweet_sample1)
print("Cleaned tweet -> " + df["tweets"].iloc[3])

## Round 2 :

Second round consists on:

- set text to lowercase.
- remove all punctuation marks
- remove numbers that are contained on words.

In [None]:
def remove_punctuation(text):
    """ 
    Function that receives a text.
    It returns the text into lowercase, 
    without text in square brackets, without punctuation marks (except #) and without numbers contained on words.
    """
    puntuation = string.punctuation.replace("#", "") + "¿¡...“”"
    text = text.lower()
    text = re.sub('\[.*?¿\]\%', ' ', text)
    text = re.sub('[%s]' % re.escape(puntuation), '', text)
    text = re.sub('\w*\d\w*', '', text).replace("\n","").replace("\r"," ").replace("  "," ").strip()
    return text

round2 = lambda x: remove_punctuation(x)

In [None]:
df['tweets'] = df['tweets'].map(round2)
df.shape

Here is a sample how the cleaned tweet variable looks like after Round 2:

In [None]:
print("Original tweet -> " + tweet_sample2)
print("Cleaned tweet -> " + df["tweets"].iloc[3])

## Round 3: 
Third round consist on remove accent marks on tweets.

In [None]:
def delete_accent_mark(word):
    """
    Function that receives a word.
    It returns the words without accent marks.
    """
    
    s = ''.join((c for c in unicodedata.normalize('NFD',word) if unicodedata.category(c) != 'Mn'))
    return s

round3 = lambda x: delete_accent_mark(x)

In [None]:
df['tweets'] = df['tweets'].map(round3)
df.shape

Here is a sample how the cleaned tweet variable looks like after Round 3:

In [None]:
print("Original tweet -> " + tweet_sample3)
print("Cleaned tweet -> " + df["tweets"].iloc[3])

## Round 4: 

On this round, some misprints words will be corrected, we will define a function to remove repeated letters, e.g: 
"holaa" >> "hola"

We consider some cases in Spanish language where it is correct, e.g.:

- "c"  -> "acceder"
- "e"  -> "leed"
- "l"  -> "llama" 
- "nn" -> "innato"
- "r"  -> "perro"
- "s"  -> "impossible"
- "p"  -> "pp"

In [None]:
def delete_repeted_letters(text):
    """
    Function that receives a text to find words with repeated letters on it.
    It returns the text without repeated letters.
    """
    result_text = []
    spanish_comun_doble_letters = ["c","e","l","n","r","s","p"] 
    
    for i in range(0,len(text)):
        if text[i] != text[i-1] or i == 0 or (text[i] in spanish_comun_doble_letters):
            result_text.append(text[i])
    result_text = ''.join(result_text)
    return result_text

round4 = lambda x : delete_repeted_letters(x)

In [None]:
df['tweets'] = df['tweets'].map(round4)
df.shape

Here is a sample how the cleaned tweet variable looks like after Round 4:

In [None]:
print("Original tweet -> " + tweet_sample4)
print("Cleaned tweet -> " + df["tweets"].iloc[3])

## Round 5:

Also, informal language contains expression that represent laught. In these round we will normalize these expressions and replace it with a unique word that represent it. 
The following function expressions like this will be normalize:
- "jajaja" >> "LOL"
- "jejeje" >> "LOL"
- "jijijijiji" >> "LOL"
- "hahahaha" >> "LOL"

In [None]:
def replace_laught(text):
    
    """
    Function that receives a text to find laught expression.
    It returns the text with laught expression replace with "LOL".
    """
    
    new_text = []
    laught_normalizer_word = 'LOL'
    for word in text.split():
        if re.match('jaja', word) or re.match('ajaj', word) or re.match('jeje', word) or re.match('ejej', word) or re.match('haha', word) or re.match('ahah', word) or re.match('jiji', word) or re.match('ijij', word):
            word = word.replace(word, laught_normalizer_word)
        new_text.append(word)
    
    result = ' '.join(new_text)
    return result 

round5 = lambda x : replace_laught(x)

In [None]:
df['tweets'] = df['tweets'].map(round5)
df.shape

Here is a sample how the cleaned tweet variable looks like after Round 5:

In [None]:
print("Original tweet -> " + tweet_sample5)
print("Cleaned tweet -> " + df["tweets"].iloc[1432])

Once we have finished all cleansing rounds, we will remove registers with empty tweets and also repeated tweets. 

In [None]:
df_no_empty_tweets = df[df["tweets"] != ""]
df_cleaned = pd.DataFrame(data=df_no_empty_tweets["tweets"].unique(), columns=["tweets"])

In [None]:
df_cleaned.head(10)

We merge unique values with rest of the columns to get the final dataframe with cleaned tweets:

In [None]:
df_final = pd.merge(df_cleaned, df.sort_values("created_date", ascending=True), how='left')

In [None]:
df_final.shape

Removing repeated tweets:

In [None]:
# eliminando tweets repetidos
df_final = df_final.drop_duplicates('tweets', keep='first')
df_final['tweets'] = df_final['tweets'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())

Finally, we remove possible extra spaces at the rest of columns:

In [None]:
df_final['tweets']           = df_final['tweets'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df_final['source']           = df_final['source'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df_final['language']         = df_final['language'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df_final['place']            = df_final['place'].map(lambda x: str(x).replace("\n","").replace("\r"," ").replace("  "," ").strip())
df_final['user_description'] = df_final['user_description'].map(lambda x: str(x).replace("\n","").replace("\r"," ").replace("  "," ").strip())
df_final['user_name']        = df_final['user_name'].map(lambda x: str(x).replace("\n","").replace("\r"," ").replace("  "," ").strip())
df_final['user_location']    = df_final['user_location'].map(lambda x: str(x).replace("\n","").replace("\r"," ").replace("  "," ").strip())
df_final['user_lang\r']      = df_final['user_lang\r'].map(lambda x: x.replace("\n","").replace("\r"," ").replace("  "," ").strip())
df_final['user_lang']        = df_final['user_lang\r']
df_final.drop("user_lang\r", axis=1, inplace=True)

In [None]:
df_final.shape

In [None]:
df_final.head()

Finally, we get the cleaned dataframe with a total of 4.866 registers and 20 columns. It is saved in a .csv file.

In [None]:
df_final.to_csv('cleaned_tweets.csv', 
                index=False, 
                header=True,
                sep='|',
                decimal=',', 
                encoding='utf-8')