# 02-Data-Preparation.ipynb

Data preparation refers to the process of cleaning and transforming raw data into a format suitable for analysis.\
This involves several steps, including data cleaning, data integration, data transformation, and data reduction.

### 1. Import necessary libraries

The first few lines of code import necessary libraries like re, string, nltk, pandas, and SnowballStemmer.\
Additionally, nltk.download() function is used to download necessary data from the NLTK library.

In [7]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import pandas as pd
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 2. Define function to preprocess and clean text

The next step defines a function called `preprocess_text()`.\
This function takes in a text input and performs various cleaning and preprocessing tasks on it. It removes URLs, mentions, and hashtags, removes non-alphabetic characters and numbers, converts text to lowercase, tokenizes text into words, removes stopwords, removes punctuation, stems words, and finally, joins the tokens back into text.

In [8]:
def preprocess_text(text):
    # Check if the input is a string or bytes object, if not, return an empty string
    if not isinstance(text, (str, bytes)):
        return ''
    # Remove URLs, mentions, and hashtags from the text using regular expressions
    text = re.sub(r'https?\S+', '', text)
    text = re.sub(r'@\S+', '', text)
    text = re.sub(r'#\S+', '', text)
    # Remove non-alphabetic characters and numbers from the text using regular expressions
    text = re.findall(r'\b(?!\d+\b)[a-zA-Z0-9]+\b', text)
    # Convert text to lowercase
    text = ' '.join(text).lower().strip()
    # Tokenize the text into words using NLTK's word_tokenize() function
    tokens = nltk.word_tokenize(text)
    # Remove stopwords using NLTK's stopwords.words('english') function
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Remove stopwords using NLTK's stopwords.words('english') function
    tokens = [token for token in tokens if token not in stopwords.words('french')]
    # Remove punctuation using the string.punctuation constant
    tokens = [token for token in tokens if token not in string.punctuation]
    # Stem words using NLTK's SnowballStemmer('english') function
    stemmer = SnowballStemmer('english')
    tokens = [stemmer.stem(token) for token in tokens]
    # Join the stemmed tokens back into text
    text = ' '.join(tokens)
    return text


The `preprocess_text()` function takes a single argument `text`, which is the input text that needs to be preprocessed.\
The function first checks whether the input is a string or bytes object using the `isinstance()` function. If it is not,
the function returns an empty string.

Next, the function removes URLs, mentions, and hashtags from the text using regular expressions. This is done using the `re.sub()` function,
which replaces any string that matches the regular expression with an empty string.

After that, the function removes any non-alphabetic characters and numbers from the text using regular expressions. This is done using the `re.findall()` function, which returns a list of all the strings that match the regular expression.

The function then converts the text to lowercase and tokenizes it into words using NLTK's `word_tokenize()` function. It removes stopwords using NLTK's `stopwords.words('english')` function, removes punctuation using the string.punctuation constant, and stems the words using NLTK's `SnowballStemmer('english')` function.

Finally, the function joins the stemmed tokens back into text and returns it.

Overall, the `preprocess_text()` function performs several common text preprocessing steps, such as removing URLs, mentions, and hashtags, removing non-alphabetic characters and numbers, tokenizing the text into words, removing stopwords and punctuation, and stemming the words.

### 3. Define function to preprocess and clean all collected tweets

The next step defines a function called `preprocess_tweets()`.\
This function takes in a list of tweets and preprocesses each tweet using the `preprocess_text()` function defined earlier.

In [9]:
def preprocess_tweets(tweets):
    for tweet in tweets:
        tweet['content'] = preprocess_text(tweet['content'])


### 4. Read raw data

The next step reads the raw data from a CSV file using the `read_csv()` function of the pandas library. The raw data is stored in a dataframe called `df`.

In [10]:
df =  pd.read_csv('../data/raw_data.csv')

### 5. Clean the dataframe

The next step applies the `preprocess_text()` function to each row of the `'content'` column of the dataframe using the `apply()` function.\
This cleans and preprocesses the text data in the dataframe.

In [None]:
df['content'] = df['content'].apply(lambda x: preprocess_text(x))

### 6. Drop duplicates

The next step drops duplicate rows from the dataframe based on the 'content' column using the `drop_duplicates()` function.

In [None]:
df.drop_duplicates(subset=['content'],inplace=True)

### 7. Drop empty content

The next step drops rows from the dataframe that have empty 'content' using the `drop()` function.

In [None]:
df.drop(index=df[df['content'] == ''].index, inplace=True)

### 8. Count the number of sentences after cleaning

the next step counts the number of sentences after cleaning, for each language. This can be useful to get an idea of the distribution of the data by language.

In [6]:
#Count the number of sentences after cleaning, for each language

In [None]:
df['language'].value_counts()

The `value_counts()` method returns a series containing counts of unique values in the 'language' column of the cleaned data frame. This will give us a count of the number of sentences in each language after cleaning.

### 9. Save clean data

In [None]:
df.to_csv('../data/clean_data.csv1',encoding='utf-8')

The `to_csv()` method saves the cleaned data frame to a new CSV file.\
The `encoding='utf-8'` parameter specifies that the file should be saved using the UTF-8 encoding, which can handle all possible characters in the text data.

In [64]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import pandas as pd
nltk.download('stopwords')
nltk.download('punkt')

# Define function to preprocess and clean text
def preprocess_text(text):
    if not isinstance(text, (str, bytes)):
        return ''
    # Remove URLs, mentions, and hashtags
    text = re.sub(r'https?\S+', '', text)
    text = re.sub(r'@\S+', '', text)
    text = re.sub(r'#\S+', '', text)
    # Remove non-alphabetic characters and numbers
    text = re.findall(r'\b(?!\d+\b)[a-zA-Z0-9]+\b', text)
    # Convert text to lowercase
    text = ' '.join(text).lower().strip()
    # Tokenize text into words
    tokens = nltk.word_tokenize(text)
    # Remove stopwords
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]
    # Stem words
    stemmer = SnowballStemmer('english')
    tokens = [stemmer.stem(token) for token in tokens]
    # Join tokens back into text
    text = ' '.join(tokens)
    return text

# Define function to preprocess and clean all collected tweets
def preprocess_tweets(tweets):
    for tweet in tweets:
        tweet['content'] = preprocess_text(tweet['content'])
        
#read raw data
df =  pd.read_csv('../data/raw_data.csv')
# clean the dataframe
df['content'] = df['content'].apply(lambda x: preprocess_text(x))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [76]:
# drop duplicates
df.drop_duplicates(subset=['content'],inplace=True)
# drop empty content
df.drop(index=df[df['content'] == ''].index, inplace=True)
# counting the number of sentences after cleaning
df['language'].value_counts()
# save clean data
df.to_csv('../data/clean_data.csv',encoding='utf-8')