# Twitter Sentiment Analysis: Data Cleaning

## Notebook Overview
This notebook is dedicated to cleaning and preparing the Twitter dataset for subsequent sentiment analysis. The primary goal is to transform raw tweet data into a clean format suitable for NLP tasks. This involves handling missing data, normalizing text, and removing noise such as special characters and URLs.

## Table of Contents
1. **Introduction**
   - Purpose of the notebook
   - Description of the dataset
2. **Data Loading**
   - Import necessary libraries
   - Load the dataset
3. **Initial Data Exploration**
   - Display basic information and statistics
   - Identify missing values and anomalies
4. **Data Cleaning**
   - Remove unnecessary columns
   - Normalize the target Column
   - Handle missing data
   - Normalize text data
   - Remove special characters, URLs, mentions, and hashtags
6. **Saving the Cleaned Data**
   - Save the processed data to a new CSV file
7. **Conclusion**
   - Summary of the data cleaning steps
   - Next steps and transition to the next phase of the project

# 1. Introduction

The focus of this notebook is to clean the data extracted from Twitter, ensuring it is primed for analysis and modeling. The dataset contains various columns from which only the text and sentiment labels are primarily needed for our sentiment analysis task.

## Data Source

The dataset you are using is the "Sentiment140" dataset. This dataset was created by researchers at Stanford University and is designed specifically for sentiment analysis tasks. It is commonly used in the machine learning community for training and testing sentiment analysis models due to its large volume and diversity of tweets. This dataset can be found here: https://www.kaggle.com/datasets/milobele/sentiment140-dataset-1600000-tweets

## Dataset Composition

The Sentiment140 dataset consists of 1.6 million tweets extracted using the Twitter API. The tweets have been annotated automatically with tags related to the sentiment of the tweet based on the emoticons present in them. Here are the details of the columns in the dataset:

target: The polarity of the tweet (0 = negative, 2 = neutral, 4 = positive). In most versions of this dataset, only negative and positive sentiments are included (0 and 4).

##### id: The unique identifier for each tweet.
##### date: The date and time when the tweet was posted.
##### flag: This field is often set to "NO_QUERY" and does not contain useful information for sentiment analysis.
##### user: The username of the tweet's author.
##### text: The text of the tweet itself.

This structure helps in various NLP tasks, particularly sentiment analysis, by providing a pre-labeled set of data for training machine learning models.

When loading this data, ensure you handle it correctly by specifying the structure and encoding, as shown previously. This will set a solid foundation for your data cleaning and analysis processes.

# 2. Data Loading

In [17]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

import re
import unicodedata

In [18]:
# Define column names for the dataset
file_path = 'raw_data/training.1600000.processed.noemoticon.csv'
column_names = ['target', 'id', 'date', 'flag', 'user', 'text']

# Load the dataset
data = pd.read_csv(file_path, names=column_names, encoding='latin1')

#### Display the first few rows of the dataset

In [19]:
data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


# 3. Initial Data Exploration

#### Display basic information about the dataset

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   id      1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


#### Let's parse the date column to datetime.

In [21]:
data['date'].head(10)

0    Mon Apr 06 22:19:45 PDT 2009
1    Mon Apr 06 22:19:49 PDT 2009
2    Mon Apr 06 22:19:53 PDT 2009
3    Mon Apr 06 22:19:57 PDT 2009
4    Mon Apr 06 22:19:57 PDT 2009
5    Mon Apr 06 22:20:00 PDT 2009
6    Mon Apr 06 22:20:03 PDT 2009
7    Mon Apr 06 22:20:03 PDT 2009
8    Mon Apr 06 22:20:05 PDT 2009
9    Mon Apr 06 22:20:09 PDT 2009
Name: date, dtype: object

In [22]:
# Correct timezone extraction
data['timezone'] = data['date'].str.extract(r"\b([A-Z]{3})\b")

# Display timezone counts
timezone_counts = data['timezone'].value_counts(dropna=False)  # Include NaN counts
print("Timezone Counts:")
print(timezone_counts)

Timezone Counts:
timezone
PDT    1600000
Name: count, dtype: int64


In [23]:
# Define a mapping of timezone abbreviations to UTC offsets
timezone_map = {
    "PDT": "-0700",
    "EDT": "-0400",
    "CST": "-0600",
    "MST": "-0700",
    "EST": "-0500",
}

# Replace timezone abbreviations with offsets in the date column
for tz, offset in timezone_map.items():
    data['date'] = data['date'].str.replace(tz, offset, regex=True)

In [24]:
# Parse the date column with timezone offsets
date_format_with_offset = "%a %b %d %H:%M:%S %z %Y"  # %z handles offsets
data['date'] = pd.to_datetime(data['date'], format=date_format_with_offset, errors='coerce')

# Verify the parsing
data[['date']].head()

Unnamed: 0,date
0,2009-04-06 22:19:45-07:00
1,2009-04-06 22:19:49-07:00
2,2009-04-06 22:19:53-07:00
3,2009-04-06 22:19:57-07:00
4,2009-04-06 22:19:57-07:00


##### Looks good!

#### Describe the dataset

In [25]:
data.describe()

Unnamed: 0,target,id
count,1600000.0,1600000.0
mean,2.0,1998818000.0
std,2.000001,193576100.0
min,0.0,1467810000.0
25%,0.0,1956916000.0
50%,2.0,2002102000.0
75%,4.0,2177059000.0
max,4.0,2329206000.0


#### Data types

In [26]:
data.dtypes

target                          int64
id                              int64
date        datetime64[ns, UTC-07:00]
flag                           object
user                           object
text                           object
timezone                       object
dtype: object

#### Let's identify missing values and anomalies

In [27]:
# Check for missing values
missing_values = data.isnull().sum()

print("Missing Values Count:")
print(missing_values)


Missing Values Count:
target      0
id          0
date        0
flag        0
user        0
text        0
timezone    0
dtype: int64


In [28]:
# Unique values in the target column
print("Unique values in 'target':")
print(data['target'].unique())

# Check for unique IDs
is_id_unique = data['id'].is_unique
print(f"Are IDs unique? {is_id_unique}")


Unique values in 'target':
[0 4]
Are IDs unique? False


In [29]:
# Unique values in the 'text' column
unique_texts_count = len(data['text'].unique())  # Count of unique texts
total_rows = len(data)  # Total number of rows in the dataset

# Calculate duplicates
duplicates_count = total_rows - unique_texts_count

# Print results
print(f"There are {unique_texts_count} unique texts.")
print(f"This means we need to drop {duplicates_count} duplicate rows.")

There are 1581466 unique texts.
This means we need to drop 18534 duplicate rows.


Let's drop them.

In [15]:
# Drop duplicates based on the 'text' column
data.drop_duplicates(subset=["text"], inplace=True)

# Reset the index for the updated DataFrame
data.reset_index(drop=True, inplace=True)

# Confirm the new size of the dataset
print(f"Dataset size after dropping duplicates: {len(data)}")

Dataset size after dropping duplicates: 1581466


In [16]:
# Check for empty or very short text
short_texts = data[data['text'].str.len() < 5]
print(f"Number of very short text entries: {len(short_texts)}")


Number of very short text entries: 0


Let's drop them

In [16]:
# Identify and count very short text entries
short_texts = data[data['text'].str.len() < 5]
print(f"Number of very short text entries: {len(short_texts)}")

# Drop short text entries
data = data[data['text'].str.len() >= 5].reset_index(drop=True)

# Confirm the new size of the dataset
print(f"Dataset size after dropping very short text entries: {len(data)}")


Number of very short text entries: 0
Dataset size after dropping very short text entries: 1581466


# 4. Data Cleaning

#### Let's drop irrelevant features

In [17]:
# Drop irrelevant columns
data = data.drop(columns=['id', 'flag', 'user', 'timezone'])

# Verify the updated DataFrame structure
data.head()

Unnamed: 0,target,date,text
0,0,2009-04-06 22:19:45-07:00,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,2009-04-06 22:19:49-07:00,is upset that he can't update his Facebook by ...
2,0,2009-04-06 22:19:53-07:00,@Kenichan I dived many times for the ball. Man...
3,0,2009-04-06 22:19:57-07:00,my whole body feels itchy and like its on fire
4,0,2009-04-06 22:19:57-07:00,"@nationwideclass no, it's not behaving at all...."


In [18]:
data.columns

Index(['target', 'date', 'text'], dtype='object')

#### Normalize the target Column (0 will be negative and 1 will be positive)

In [19]:
# Normalize the target column
data['target'] = data['target'].replace({4: 1})

# Verify normalization
print("Unique values in 'target' after normalization:")
print(data['target'].unique())

Unique values in 'target' after normalization:
[0 1]


#### Normalize text data

In [20]:
# Define a function to normalize text
def normalize_text(text):
    text = text.lower()  # Convert to lowercase
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')  # Normalize encoding
    text = re.sub(r"http\S+|www\S+|https\S+|ftp\S+", "", text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r"@\w+", "", text)  # Remove mentions
    text = re.sub(r"#\w+", "", text)  # Remove hashtags
    text = re.sub(r"[^\w\s]", "", text)  # Remove special characters
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra whitespace
    return text

# Apply normalization to the text column
data['text'] = data['text'].apply(normalize_text)

# Verify the results
print("Sample normalized text:")
print(data['text'].head())

Sample normalized text:
0    a thats a bummer you shoulda got david carr of...
1    is upset that he cant update his facebook by t...
2    i dived many times for the ball managed to sav...
3       my whole body feels itchy and like its on fire
4    no its not behaving at all im mad why am i her...
Name: text, dtype: object


#### Now let's validate the normalization

In [21]:
def validate_cleaning(data):
    """
    Validates if all mentions, URLs, hashtags, and special characters are removed.
    
    Parameters:
        data (DataFrame): The DataFrame containing the `text` column.
        
    Returns:
        dict: Counts of remaining unwanted elements.
    """
    patterns = {
        "mentions": r"@\w+",
        "urls": r"http\S+|www\S+|https\S+",
        "hashtags": r"#\w+",
        "special_characters": r"[^\w\s]",
    }
    
    results = {}
    for key, pattern in patterns.items():
        matches = data['text'].str.contains(pattern, regex=True).sum()
        results[key] = matches
    
    return results

# Run the validation
validation_results = validate_cleaning(data)

# Display the results
print("Validation Results:")
print(validation_results)

Validation Results:
{'mentions': 0, 'urls': 17, 'hashtags': 0, 'special_characters': 0}


#### Display some random samples of the text column to double check

In [22]:
random_samples = data['text'].sample(200, random_state=42)

print("Random Samples of Text:")
for i, text in enumerate(random_samples, start=1):
    print(f"{i}: {text}")

Random Samples of Text:
1: hans im an open book you can ask me anything you want hit up the blog and ask away i can give you more in depth answer there
2: wuaaah we have strawberries at home
3: very cool and to quote tbs quotnow thats funnyquot
4: the last sunday
5: oh thanks thats good to know
6: hi kate how are you
7: nice i like that combo
8: youre always welcome just give me a headsup so i can put down the machete and rinse off the bug spray
9: excited for tomorrows swim date
10: i want to call you but i live in braaaaazil xoxo
11: need a wireless card or usb hook up for my computer theres no cable jack in my room amp nor can they put one in
12: is it possible to have enough beer haha jk they started running out when i went but i was dd so i barely sipped them
13: yeah but now you wont be sitting with us and austins having trouble getting tickets
14: hoping megan is having a good birthday and wondering where everyone is
15: gave up with the studyin thing watchhinn tv
16: great loss

### Observations

#### Weirdly Encoded Characters
- **Example**: `fckbri12ndby 40` (Sample 49)  
  - This still looks unusual and might indicate an encoding issue or nonsensical text entry.
  - Likely originated from corrupted data or abbreviation/slang.

#### Short/Unclear Contexts
- **Examples**:  
  - `high low` (Sample 165)  
  - `diversity won damnn i wanted flawless to win` (Sample 160)  
  - `downtown disney` (Sample 200)  
  - These entries are brief and may lack meaningful context for sentiment analysis.

#### Edge Cases
- **Example**: `morning twitterlandoff to work i go` (Sample 176)  
  - Contains typos or casual language (`twitterland`), which is common in tweets but may affect NLP preprocessing.


# 5. Saving the Cleaned Data

In [23]:
data.to_csv('./clean_data/cleaned_twitter_data.csv', index=False)

# 6. Conclusion


In this notebook, we successfully cleaned the raw text data from the Sentiment140 dataset. Below is a summary of the tasks completed:

1. **Loaded the Dataset**:
   - Imported the Sentiment140 dataset and examined its structure and data types.

2. **Handled Missing Values and Anomalies**:
   - Verified that no missing values were present.
   - Ensured no duplicate rows exist.
   - Identified and addressed encoding issues and nonsensical entries.

3. **Formatted the `date` Column**:
   - Standardized the `date` column by replacing timezone abbreviations with UTC offsets.
   - Converted the column to the correct `datetime` datatype for consistency.

4. **Normalized Text Data**:
   - Removed mentions, hashtags, URLs, and special characters.
   - Converted text to lowercase for uniformity.
   - Validated the cleaning process to ensure all unwanted elements were removed.



With this, the data is now clean and ready for further analysis.

---

## Next Steps: Exploratory Data Analysis (EDA)

The next notebook will focus on exploring the cleaned dataset. This will include:
- Visualizing the distribution of sentiment classes.
- Analyzing tweet characteristics (e.g., text length, word usage).
- Identifying patterns and relationships that may influence sentiment.

This EDA will provide valuable insights to guide the preprocessing and modeling phases.
