Here you are required to write the code base necessary to do data analysis and produce a machine learning model to perform topic modelling and sentiment analysis.

You are required to write the following modules at the minimum:

* Data exploration and pre-processing: the codes you fixed at step 1 are the modules you will extend to perform data reading, pre-processing and data exploration and visualisations
* Topic modelling and sentiment analysis: write a code using scikit-learn, Gensim, or other packages and APIs to model the topics discussed in the tweets and their sentiments. You may use word clouds, k-mean clustering, etc. as a simple model for topic modelling.

In [1]:
# import libraries
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import STOPWORDS,WordCloud
import gensim
from gensim.models import CoherenceModel
from gensim import corpora
import pandas as pd
from pprint import pprint
import string
import os
import re

In [4]:
# read data into a dataframe
twitter_df = pd.read_csv('cleaned_fintech_data.csv')

In [5]:
twitter_df.head()

Unnamed: 0.1,Unnamed: 0,created_at,source,original_text,clean_text,sentiment,polarity,subjectivity,lang,favorite_count,...,original_author,screen_count,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place,place_coord_boundaries,timestamp
0,0.0,Thu Jun 17 06:26:34 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...",Giving forth life is becoming a burden in Keny...,Giving forth life becoming burden Kenya This m...,"Sentiment(polarity=0.3194444444444445, subject...",0.3194444444444445,0.5305555555555556,en,0,...,reen_law,398,70,223,,,janetmachuka_,,,2021-06-17 06:26:34+00:00
1,1.0,Thu Jun 17 06:26:37 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",Teenmaar - 26cr\nPanja - 32.5cr\nGabbarsingh -...,Teenmaar crPanja crGabbarsingh cr Khaleja Kuda...,"Sentiment(polarity=0.0, subjectivity=0.0)",0.0,0.0,in,0,...,Amigo9999_,19047,132,1084,,,maheshblood,,India,2021-06-17 06:26:37+00:00
2,2.0,Thu Jun 17 06:26:42 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",Rei chintu 2013 lo Vachina Ad Nizam ne 2018 lo...,Rei chintu lo Vachina Ad Nizam ne lo kottaru f...,"Sentiment(polarity=0.0, subjectivity=0.0)",0.0,0.0,hi,0,...,MallaSuhaas,47341,2696,2525,,,Hail_Kalyan,,Vizag,2021-06-17 06:26:42+00:00
3,3.0,Thu Jun 17 06:26:44 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...",Today is World Day to Combat #Desertification ...,Today World Day Combat Restoring degraded land...,"Sentiment(polarity=0.25, subjectivity=0.65)",0.25,0.65,en,0,...,CIACOceania,7039,343,387,,"Desertification, Drought, resilience",EdwardVrkic,,Papua New Guinea,2021-06-17 06:26:44+00:00
4,4.0,Thu Jun 17 06:26:47 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",Hearing #GregHunt say he's confident vaccines ...,Hearing say 's confident vaccines delivered li...,"Sentiment(polarity=0.5, subjectivity=0.8333333...",0.5,0.8333333333333334,en,0,...,MccarronWendy,26064,419,878,,"GregHunt, Morrison",WriteWithDave,,"Sydney, New South Wales",2021-06-17 06:26:47+00:00


In [19]:
# size of data
twitter_df.shape

(5621, 21)

In [6]:
# get information about the data
twitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5621 entries, 0 to 5620
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              5620 non-null   float64
 1   created_at              5621 non-null   object 
 2   source                  5621 non-null   object 
 3   original_text           5621 non-null   object 
 4   clean_text              5617 non-null   object 
 5   sentiment               5621 non-null   object 
 6   polarity                5621 non-null   object 
 7   subjectivity            5621 non-null   object 
 8   lang                    5621 non-null   object 
 9   favorite_count          5621 non-null   object 
 10  retweet_count           5621 non-null   object 
 11  original_author         5621 non-null   object 
 12  screen_count            5621 non-null   object 
 13  followers_count         5621 non-null   object 
 14  friends_count           5621 non-null   

In [7]:
# check for missing values
print("The number of missing value(s): {}".format(twitter_df.isnull().sum().sum()))

The number of missing value(s): 17941


In [16]:
# define a function data that takes in a dataframe
def missing(data): 
    '''a function to check for missing values count and percentage missing'''
    #data = data.isnull().sum()
    count_missing = data.isnull().sum() # calculate total sum of missing data
    count_missing_percentage=round((data.isnull().sum()*100/len(data))) # multiply sum of missing data by 100 and divide by length of the whole data to calcualte the missing percentage of a column and round up 
    missing_column_name=data.columns 
    missing_df=pd.DataFrame(zip(count_missing,count_missing_percentage,missing_column_name),
                           columns=['Missing Count', '%Missing', 'ColumnName']) # create a dataframe containing column names, missing count and percent missing
    missing_df = missing_df.set_index('ColumnName') # set missing columns as index
    return missing_df

In [17]:
# use fuction on dataframe
missing(twitter_df)

Unnamed: 0_level_0,Missing Count,%Missing
ColumnName,Unnamed: 1_level_1,Unnamed: 2_level_1
Unnamed: 0,1,0.0
created_at,0,0.0
source,0,0.0
original_text,0,0.0
clean_text,4,0.0
sentiment,0,0.0
polarity,0,0.0
subjectivity,0,0.0
lang,0,0.0
favorite_count,0,0.0


In [None]:
class PrepareData:
      def __init__(self,df):
        self.df=df
    
      def preprocess_data(self):
        tweets_df = self.df.loc[self.df['lang'] =="en"]

    
    #text Preprocessing
    tweets_df['clean_text']=tweets_df['clean_text'].astype(str)
    tweets_df['clean_text'] = tweets_df['clean_text'].apply(lambda x: x.lower())
    tweets_df['clean_text']= tweets_df['clean_text'].apply(lambda x: x.translate(str.maketrans(' ', ' ', string.punctuation)))
    
    #Converting tweets to list of words For feature engineering
    sentence_list = [tweet for tweet in tweets_df['clean_text']]
    word_list = [sent.split() for sent in sentence_list]

    #Create dictionary which contains Id and word 
    word_to_id = corpora.Dictionary(word_list)
    corpus_1= [word_to_id.doc2bow(tweet) for tweet in word_list]



    
    return word_list, word_to_id, corpus_1

In [None]:
PrepareData_obj=PrepareData(twitter_df)

In [21]:
# drop row having Null value
twitter_df.dropna()
twitter_df.head()

Unnamed: 0.1,Unnamed: 0,created_at,source,original_text,clean_text,sentiment,polarity,subjectivity,lang,favorite_count,...,original_author,screen_count,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place,place_coord_boundaries,timestamp
