**1. Data Preprocessing**

**Load the Data**

Read the dataset into a Pandas DataFrame.

Dataset reference - Big Tech Companies - Tweet Sentiment
https://www.kaggle.com/datasets/wjia26/big-tech-companies-tweet-sentiment

In [None]:
import pandas as pd

# Code to load the dataset
def load_dataset(file_path):
    try:
        data = pd.read_csv(file_path)
        return data
    except Exception as e:
        return str(e)

file_path = 'tech_tweets.csv'

# Load the dataset
tweets_df = load_dataset(file_path)
tweets_df.head()  # Display the first few rows of the dataframe

Unnamed: 0,created_at,file_name,followers,friends,group_name,location,retweet_count,screenname,search_query,text,twitter_id,username,polarity,partition_0,partition_1
0,10/5/2020 8:44,Nvidia,41,410,Nvidia,sydney,0,jyolyu,#Nvidia,#NVIDIA #GauGAN is actually a good tool to pra...,1.31304e+18,N0%Ice,0.4404,Technology,Nvidia
1,10/5/2020 8:44,Nvidia,367,267,Nvidia,PARIS,0,MiClaverie,#Nvidia,"#BullSequana X2415, the first #supercomputer b...",1.31304e+18,MichÃ¨le Claverie,0.0,Technology,Nvidia
2,10/5/2020 8:41,Nvidia,14,104,Nvidia,Japan,0,_stingraze,#Nvidia,I'm going to attend GTC 2020 tonight! Excited....,1.31304e+18,Tsubasa Kato,0.4003,Technology,Nvidia
3,10/5/2020 8:28,Nvidia,18286,941,Nvidia,,0,gamingonlinux,#Nvidia,#NVIDIA delays launch of #GeForce RTX 3070 unt...,1.31303e+18,GamingOnLinux ðŸ§,0.0,Technology,Nvidia
4,10/5/2020 8:18,Nvidia,42,84,Nvidia,"Paris, France",0,anupdshetty,#Nvidia,"#BullSequana X2415, the first #supercomputer b...",1.31303e+18,Anup Shetty,0.0,Technology,Nvidia


The dataset contains the following columns:

*  created_at: The date and time when the tweet was created.
*  file_name: A category or file name associated with the tweet, here it seems to be the company name.
*  followers: The number of followers of the tweet author.
*  friends: The number of friends (or following count) of the tweet author.
*  group_name: Another category name, similar to file_name.
*  location: The location of the tweet author.
*  retweet_count: The number of times the tweet was retweeted.
*  screenname: The screen name of the tweet author.
*  search_query: The query used to find the tweet, likely a hashtag.
*  text: The text of the tweet.
*  twitter_id: The unique identifier for the tweet.
*  username: The username of the tweet author.
*  polarity: A sentiment polarity score.
*  partition_0 and partition_1: additional categorical partitions or groupings.





In [None]:
# Display the shape of the data frame
tweets_df_shape = tweets_df.shape

In [None]:
# List the columns of the data frame
tweets_df_columns = tweets_df.columns

In [None]:
# Sort and display the top 10 tweets by the number of retweets
tweets_df_sorted_by_retweets = tweets_df.sort_values(by='retweet_count', ascending=False)[['text', 'created_at', 'username', 'location', 'retweet_count']].head(10)

In [None]:
# Sort and display the top 10 tweets by date and number of favorites
tweets_df_sorted_by_date_and_retweets = tweets_df.sort_values(by=['created_at', 'retweet_count'], ascending=[True, False])[['text', 'created_at', 'username', 'location', 'retweet_count']].head(10)

tweets_df_shape, tweets_df_columns, tweets_df_sorted_by_retweets, tweets_df_sorted_by_date_and_retweets


((14323, 16),
 Index(['created_at', 'file_name', 'followers', 'friends', 'group_name',
        'location', 'retweet_count', 'screenname', 'search_query', 'text',
        'twitter_id', 'username', 'polarity', 'partition_0', 'partition_1',
        'cleaned_text'],
       dtype='object'),
                                                     text        created_at  \
 6694   [ STREAMING PARTY BACK DOOR ]  JOIN OUR MASS S...    9/28/2020 9:08   
 9137   #TheGiftedGraduationEP6 Catch-up  On #YouTube ...  10/11/2020 14:34   
 5973   [ STREAM BACK DOOR ]  JOIN OUR MASS STREAMING ...   9/22/2020 15:26   
 10413  Question No.4 #DesiDimeCricketFever  Comment y...  10/10/2020 12:01   
 8394   SOUND ON ðŸ”Š Â¡Corre a mi canal de #YouTube p...    9/29/2020 0:43   
 13718  Who is ready for tomorrow? #AmbreV2   Cydia re...   9/29/2020 17:16   
 13962  Epic ðŸ˜‚ #Coding #CodeNewbies #100DaysOfCode ...   10/3/2020 10:53   
 13430  Google ChromeCast (Google TV, Sabrina) appears...    9/26/2020 4:34   
 1

**Understanding & Preprocessing Data**

**Data Info**

In [None]:
# Display information about the dataset
tweets_df_info = tweets_df.info()

# Display descriptive statistics of the dataset
tweets_df_describe = tweets_df.describe()

tweets_df_info, tweets_df_describe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14323 entries, 0 to 14322
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   created_at     14323 non-null  object 
 1   file_name      14323 non-null  object 
 2   followers      14323 non-null  int64  
 3   friends        14323 non-null  int64  
 4   group_name     14323 non-null  object 
 5   location       10941 non-null  object 
 6   retweet_count  14323 non-null  int64  
 7   screenname     14323 non-null  object 
 8   search_query   14323 non-null  object 
 9   text           14323 non-null  object 
 10  twitter_id     14323 non-null  float64
 11  username       14323 non-null  object 
 12  polarity       14323 non-null  float64
 13  partition_0    14323 non-null  object 
 14  partition_1    14323 non-null  object 
 15  cleaned_text   14323 non-null  object 
dtypes: float64(2), int64(3), object(11)
memory usage: 1.7+ MB


(None,
           followers       friends  retweet_count    twitter_id      polarity
 count  1.432300e+04  1.432300e+04   14323.000000  1.432300e+04  14323.000000
 mean   1.137609e+04  3.421877e+03       0.732109  1.311577e+18      0.177893
 std    1.212538e+05  1.569047e+04       4.487186  2.655794e+15      0.376158
 min    0.000000e+00  0.000000e+00       0.000000  1.307470e+18     -0.971100
 25%    7.100000e+01  9.750000e+01       0.000000  1.308850e+18      0.000000
 50%    4.870000e+02  4.720000e+02       0.000000  1.310910e+18      0.000000
 75%    3.515000e+03  2.289000e+03       1.000000  1.313870e+18      0.458800
 max    9.756922e+06  1.162364e+06     333.000000  1.315730e+18      0.987900)

**Handling Missing Data**

Check for missing values and decide how to handle them.

In [None]:
# Check for missing values in the dataset
missing_values = tweets_df.isnull().sum()
missing_values_percentage = (tweets_df.isnull().sum() / len(tweets_df)) * 100

missing_data = pd.DataFrame({'Total Missing': missing_values, 'Percentage': missing_values_percentage})
missing_data.sort_values(by='Total Missing', ascending=False)


Unnamed: 0,Total Missing,Percentage
location,3382,23.612372
created_at,0,0.0
file_name,0,0.0
followers,0,0.0
friends,0,0.0
group_name,0,0.0
retweet_count,0,0.0
screenname,0,0.0
search_query,0,0.0
text,0,0.0


The only column with missing data is location, with approximately 23.61% of its values missing. For the purpose of text analysis, we will not focus on the location data. Hence, we can proceed without imputing or dropping these missing values.

**Unique Values**

In [None]:
import numpy as np
# Define the function to calculate unique values
def unique_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    uniques = []
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
    tt['Uniques'] = uniques
    return(np.transpose(tt))

# Calculate unique values
unique_values_result = unique_values(tweets_df)
unique_values_result

Unnamed: 0,created_at,file_name,followers,friends,group_name,location,retweet_count,screenname,search_query,text,twitter_id,username,polarity,partition_0,partition_1,cleaned_text
Total,14323,14323,14323,14323,14323,10941,14323,14323,14323,14323,14323,14323,14323,14323,14323,14323
Uniques,5331,5,3409,3035,5,3194,34,8518,5,14145,401,8495,1636,1,5,11431


**Most frequent values and Date convertion**

In [None]:
# Define the function to find most frequent values
def most_frequent_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    items = []
    vals = []
    for col in data.columns:
        itm = data[col].value_counts().index[0]
        val = data[col].value_counts().values[0]
        items.append(itm)
        vals.append(val)
    tt['Most frequent item'] = items
    tt['Frequence'] = vals
    tt['Percent from total'] = np.round(vals / total * 100, 3)
    return(np.transpose(tt))

# Find most frequent values
most_frequent_values_result = most_frequent_values(tweets_df)

# Convert the 'created_at' column to a date format
tweets_df['date_new'] = pd.to_datetime(tweets_df['created_at']).dt.date

# Drop duplicate tweets based on the 'text' column
tweets_df_deduped = tweets_df.drop_duplicates('text')

# Display the number of unique dates
unique_dates_count = len(tweets_df_deduped['date_new'].unique())

# Set the maximum display width for DataFrame columns
pd.set_option('display.max_colwidth', 700)

most_frequent_values_result, tweets_df_deduped.head(), unique_dates_count

(                         created_at file_name followers friends group_name  \
 Total                         14323     14323     14323   14323      14323   
 Most frequent item  10/5/2020 21:00   Youtube     40559   12841    Youtube   
 Frequence                        31      6452       399     399       6452   
 Percent from total            0.216    45.046     2.786   2.786     45.046   
 
                                            location retweet_count screenname  \
 Total                                         10941         14323      14323   
 Most frequent item  ðŸŒ - ðŸ‡ºðŸ‡¸ - âœ¶âœ¶âœ¶âœ¶              0     iammab   
 Frequence                                      1362          9174       1362   
 Percent from total                           12.449        64.051      9.509   
 
                    search_query  \
 Total                     14323   
 Most frequent item     #Youtube   
 Frequence                  6452   
 Percent from total       45.046   
 
              

**Drop duplicate tweets based on the 'text' column**

In [None]:
# Drop duplicate tweets based on the 'text' column
tweets_df_deduped = tweets_df.drop_duplicates('text')

# Display the new shape of the dataset
new_shape = tweets_df_deduped.shape

# Explore the 'source' column equivalent in this dataset
# Assuming 'file_name' or 'search_query' as the equivalent of 'source'
source_counts_file_name = tweets_df_deduped['file_name'].value_counts()
source_counts_search_query = tweets_df_deduped['search_query'].value_counts()

new_shape, source_counts_file_name, source_counts_search_query



((14145, 17),
 Youtube      6368
 Microsoft    2914
 Amazon       2820
 Nvidia       1056
 Google        987
 Name: file_name, dtype: int64,
 #Youtube           6368
 #Microsoft         2914
 #Amazon OR #AWS    2820
 #Nvidia            1056
 #Google             987
 Name: search_query, dtype: int64)

**Text Cleaning**

* Lowercasing all the text.
* Removing special characters, URLs, and numbers.
* Handling or removing emojis.
* Expanding contractions (e.g., converting "don't" to "do not").
* Removing stop words.
* Stemming or Lemmatization.

**Install neattext library**

In [None]:
!pip install neattext

Collecting neattext
  Downloading neattext-0.1.3-py3-none-any.whl (114 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/114.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m71.7/114.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.7/114.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neattext
Successfully installed neattext-0.1.3


**Clean the text data**

In [None]:
import neattext as ntx

# Function to clean the text data
def clean_text(text):
    text = ntx.remove_urls(text)  # Remove URLs
    text = ntx.remove_userhandles(text)  # Remove user handles
    text = ntx.remove_hashtags(text)  # Remove hashtags
    text = ntx.remove_special_characters(text)  # Remove special characters
    text = ntx.remove_multiple_spaces(text)  # Remove multiple spaces
    return text.strip()

# Apply the cleaning function to the text column
tweets_df['cleaned_text'] = tweets_df['text'].apply(clean_text)

# Display the cleaned text along with the original text for comparison
tweets_df[['text', 'cleaned_text']].head()

Unnamed: 0,text,cleaned_text
0,#NVIDIA #GauGAN is actually a good tool to pra...,is actually a good tool to practice compositio...
1,"#BullSequana X2415, the first #supercomputer b...",X2415 the first blade server in Europe to inte...
2,I'm going to attend GTC 2020 tonight! Excited....,Im going to attend GTC 2020 tonight Excited
3,#NVIDIA delays launch of #GeForce RTX 3070 unt...,delays launch of RTX 3070 until end of October
4,"#BullSequana X2415, the first #supercomputer b...",X2415 the first blade server in Europe to inte...


**Clean the text data using regular expressions**

In [None]:
import re

# Function to clean the text data using regular expressions
def clean_text_alternative(text):
    text = re.sub(r"http\S+", "", text)  # Remove URLs
    text = re.sub(r"@\S+", "", text)  # Remove user handles
    text = re.sub(r"#\S+", "", text)  # Remove hashtags
    text = re.sub(r"[^0-9a-zA-Z]+", " ", text)  # Remove special characters
    text = re.sub(r"\s+", " ", text)  # Remove multiple spaces
    return text.strip()

# Apply the cleaning function to the text column
tweets_df['cleaned_text'] = tweets_df['text'].apply(clean_text_alternative)

# Display the cleaned text along with the original text for comparison
tweets_df[['text', 'cleaned_text']].head()

Unnamed: 0,text,cleaned_text
0,#NVIDIA #GauGAN is actually a good tool to practice composition/framing? https://t.co/CJyec1kNU1,is actually a good tool to practice composition framing
1,"#BullSequana X2415, the first #supercomputer blade server in Europe to integrate #NVIDIAâ€™s Ampere next-generation graphics processing unit architecture, the NVIDIA A100 Tensor Core #GPU s. Read more â–¶ https://t.co/5HXnY4IoJ5 Atos is sponsor at #Nvidia #GTC20 Digital https://t.co/KSd2CmfsUH",X2415 the first blade server in Europe to integrate Ampere next generation graphics processing unit architecture the NVIDIA A100 Tensor Core s Read more Atos is sponsor at Digital
2,I'm going to attend GTC 2020 tonight! Excited. #Nvidia #GTC20,I m going to attend GTC 2020 tonight Excited
3,#NVIDIA delays launch of #GeForce RTX 3070 until end of October https://t.co/5Jznv1aJkd https://t.co/uz0uGXbTQj,delays launch of RTX 3070 until end of October
4,"#BullSequana X2415, the first #supercomputer blade server in Europe to integrate #NVIDIAâ€™s Ampere next-generation graphics processing unit architecture, the NVIDIA A100 Tensor Core #GPU s. Read more â–¶ https://t.co/SvG04NXgsa Atos is sponsor at #Nvidia #GTC20 Digital https://t.co/gjVTFQiDwz",X2415 the first blade server in Europe to integrate Ampere next generation graphics processing unit architecture the NVIDIA A100 Tensor Core s Read more Atos is sponsor at Digital
