## Introduction

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| Required to import, and briefly discuss, the libraries that will be used throughout analysis and modelling. |

---

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| Load the data from the `train and test` file into a DataFrame. |

---

In [None]:
# loading training dataset
df_train = pd.read_csv("train.csv") 

# loading testing dataset
df_test = pd.read_csv("test.csv")



<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| Perform an in-depth analysis of all the variables in the DataFrame. |

---


##### Reading the dataset

In [None]:
print(df_train.shape)
print(df_test.shape)

print(df_train.head(10), "\n")
print(df_test.head(10))

In [None]:
#displays the number of rows and columns 
# df_train.shape
df_test.describe

In [None]:
''''
Displays info about the columns
'''
df_train.info()


#### Converting the sentiments form number to words

In [None]:
# Convert sentiments from numbers to words
""" The followinng function takes the original dataframe as a parameter then create a copy of it and
then corverts the sentiments from numbers to words """

def update(df):
    df = df_train.copy()
    sentiment = df['sentiment']
    word_sentiment = []

    for i in sentiment :
        if i == 1 :
            word_sentiment.append('Pro')
        elif i == 0 :
            word_sentiment.append('Neutral')
        elif i == -1 :
            word_sentiment.append('Anti')
        else :
            word_sentiment.append('News')

    df['sentiment'] = word_sentiment
    
    return df

df_train = update(df_train)
df_train.head()

#### Check for duplicates

In [None]:
duplicate_tweets = round((1-(df_train['message'].nunique()/len(df_train['message'])))*100,2)
print('Duplicate tweets percentage:', duplicate_tweets,'%')

The duplicates are caused by the retweets, So about 10,5 of the tweets in our dataset are Retweets(RT)

#### Check Number of tweets for each sentimenrt class

In [None]:
df_train['sentiment'].value_counts()

#### Check distribution of sentiments

In [None]:
sns.countplot(x='sentiment', data=df_train)
plt.title('Distribution of Sentiments')
plt.show()


#### Proportion of tweets in each sentiment

In [None]:
# Plot the proportion of tweets in each sentiment
perc = df_train['sentiment'].value_counts()
perc.plot(kind='pie', autopct='%1.1f%%')
plt.title('Proportion of tweets in each sentiment')

plt.show()

The proportion of tweets in each in each sentiment shows that the Pro climate change is the majority 

#### Tweet Data Analysis

1. Average length of tweets

In [None]:
# Analyze the characteristics of the tweet text
df_train['tweet_length'] = df_train['message'].apply(len)

2. Distribution of tweet length

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df_train['tweet_length'], bins=50, kde=True)
plt.title('Distribution of Tweet Lengths')
plt.xlabel('Tweet Length')
plt.show()

In [None]:
# Plot the distribution of the length tweets for each class using a box plot

sns.boxplot(x=df_train['sentiment'], y=df_train['tweet_length'], data=df_train)
plt.title('Tweet length for each class')
plt.show()

3. Common words and phrases

In [None]:
'''
Concatenates the all the tweets into a single string
and tokenises the text into individual words and
outputs the most common words & its frequencies
'''
all_text = ' '.join(df_train['message'].astype(str))

tokens = word_tokenize(all_text)
fdist = FreqDist(tokens)
common_words = fdist.most_common(10)
print("Common Words:", common_words)

bi_grams = list(bigrams(tokens))
bi_gram_freq = FreqDist(bi_grams)
common_bigrams = bi_gram_freq.most_common(10)
print("Common Bigrams:", common_bigrams)

4. WordCloud

In [None]:
tweets = df_train['message'].values

if len(tweets) > 0:
    positive_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(tweets))
    plt.figure(figsize=(10, 6))
    plt.imshow(positive_wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('WordCloud for all tweets')
else:
    print('No tweets found.')


5. Verify for any null values

In [None]:
df_train.isnull().sum()
df_test.isnull().sum()

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| Clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
df_train.head()

In [None]:
# Check for missing values
print("Missing values in df_train:\n", df_train.isnull().sum())
print("Missing values in df_test:\n", df_test.isnull().sum())

In [None]:
print(df_test.info)

In [None]:
# Function to clean the message column
def clean_text(text):
    # Remove mentions (@user), URLs, and special characters
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)  # Remove mentions
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^A-Za-z\s]', '', text)  # Remove special characters
    
    text = text.lower()
    
    tokens = word_tokenize(text)
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_text = [word for word in tokens]
    
    # Join the filtered words back into a sentence
    text = ' '.join(filtered_text)
    return text



In [None]:
# Splitting out the X from the target
y = df_train['sentiment']
X = df_train['message']

# Turning text into something your model can read
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=2, stop_words ="english")
X_vectorized = vectorizer.fit_transform(X)


<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| Create one or more classification models that are able to accurately predict |

---

In [None]:
# splitting the training data into a training set and a validation set
from sklearn.model_selection import train_test_split


X_train,X_val,y_train,y_val = train_test_split(X_vectorized,y,test_size=.3,shuffle=True, stratify=y, random_state=11)

In [None]:
# training the model and evaluating using the evaluation set
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_val)

In [None]:
# Linear regression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_val)

In [None]:
# Linear SVC
lsvc = LinearSVC(class_weight='balanced')
lsvc.fit(X_train, y_train)
lsvc_pred = lsvc.predict(X_val)

In [None]:
# K - nearest neighbors
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_val)

In [None]:
# Niave bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_val)

In [None]:
# checking the perfomance of our model on the validations
f1_score(y_val, knn_pred, average="macro")

In [None]:
# Getting our test ready

testx = df_test['message']
test_vect = vectorizer.transform(testx)

In [None]:
# Making predictions on the test set and adding sentiment column to our original test df
y_pred = rfc.predict(test_vect)

In [None]:
# Making predictions on the test set and adding a sentiment column to our original test df

y_pred = rfc.predict(test_vect)


In [None]:
y_pred

In [None]:
df_test['sentiment'] = y_pred
df_test.head

In [None]:
df_test[['tweetid','sentiment']].to_csv('submission.csv', index =False)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| Compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---