# Analysis Details

**What is Topic Modelling?**

Topic modelling is a method for finding a group of words (i.e. topics) from a collection of documents that best represents the information in the collection of text documents. It can also be thought of as a form of text mining - a way to obtain recurring patterns of words in textual data. The topics identified are crucial data points in helping a business figure out where to put their efforts in improving their product or services.

**Project description**

In this project, I will use Kmeans to cluser/group customer reviews from twitter data with the aim of identifying the main topics/ideas in the tweets.

# NLTK

Human language data is a very unstructured form of data. Natural Language Toolkit (NLTK) is a library that provides preprocessing and modelling tools for text data. Some of the tools include, classification, tokenization, stemming, tagging. 

# Loading and Exploring Data

In [1]:
import nltk
import numpy as np
import re # remove regex
import pandas as pd 

%matplotlib inline

In [2]:
## Get multiple outputs in the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

In [3]:
raw_data  = pd.read_csv("tweets.csv",encoding = 'ISO-8859-1')
print(len(raw_data))

21047


In [4]:
df = raw_data
df.head()

Unnamed: 0,username,date,tweet,mentions
0,shivaji_takey,10-06-2020,Please check what happens to this no 940417705...,['vodafonein']
1,sarasberiwala,10-06-2020,Network fluctuations and 4G Speed is pathetic....,['vodafonein']
2,chitreamod,10-06-2020,This has been going on since 3rd... this absol...,['vodafonein']
3,sanjan_suman,10-06-2020,@VodafoneIN I have done my recharge of 555 on...,['vodafonein']
4,t_nihsit,10-06-2020,But when???Still I am not received any call fr...,['vodafonein']


In [5]:
# check the number of unique tweets
unique_text = df.tweet.unique()
print(len(unique_text))

21047


In [6]:
df['tweet'][444]

'Can you share me good plan and can tell me how can i port my network operator'

So there are 21,047 tweets about Vodafone

# Cleaning

Text preprocessing is naturally different than classical numerical preprocessing. However, it is equally as important, if not more. Common preprocessing tasks are:

- Lowercase
- Dealing with numbers and punctuation
- Removing "stopwords"
- Tokenizing
- Stemming or Lemmatizing

In [7]:
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
        
    return input_txt

## Remove any @ mentions

Remove @ from all of the tweets so that our model does not try to interpret it. 

In [8]:
df['Clean_text'] = np.vectorize(remove_pattern)(df['tweet'], "@[\w]*")
df['Clean_text'].head()

0    Please check what happens to this no 940417705...
1    Network fluctuations and 4G Speed is pathetic....
2    This has been going on since 3rd... this absol...
3      I have done my recharge of 555 on 9709333370...
4    But when???Still I am not received any call fr...
Name: Clean_text, dtype: object

## Remove numbers and punctuation

Punctuation is rarely respected in modern text forms (e.g. social media). Unless you can guarantee a proper use of punctuation across the entire dataset, it is better to remove it.

In [9]:
df['Clean_text'] = df['Clean_text'].str.replace("[^a-zA-Z#]", " ")
df['Clean_text'].head()

0    Please check what happens to this no          ...
1    Network fluctuations and  G Speed is pathetic ...
2    This has been going on since  rd    this absol...
3      I have done my recharge of     on           ...
4    But when   Still I am not received any call fr...
Name: Clean_text, dtype: object

## Lower case

Text modelling algorithms are case sensitive. Two words need to have the same casing to be considered the same.

In [10]:
df["Clean_text"]= df["Clean_text"].str.lower() 
df['Clean_text']

0        please check what happens to this no          ...
1        network fluctuations and  g speed is pathetic ...
2        this has been going on since  rd    this absol...
3          i have done my recharge of     on           ...
4        but when   still i am not received any call fr...
                               ...                        
21042    i sent u my contact no  but still did not get ...
21043    dear   i have bn facing ur network problem for...
21044    rubbish i made many time   you didn t resolved...
21045    why the caller tunes sound so horrible  if a s...
21046      what nonsense are u guys saying    i m getti...
Name: Clean_text, Length: 21047, dtype: object

## Remove whitespace and words of length 1 or 2

In [11]:
df['Clean_text'] = df['Clean_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
df['Clean_text'].head()

0    please check what happens this not woking sinc...
1    network fluctuations and speed pathetic need j...
2    this has been going since this absolutely unpr...
3    have done recharge but haven got perday with u...
4    but when still not received any call from cust...
Name: Clean_text, dtype: object

## Tokenizing

Tokenizing means transforming a single string into a list of words, also called word tokens. For preprocessing tasks dealing with entire words, you will need to tokenize your text.

In [12]:
df['Clean_text'] = df['Clean_text'].apply(lambda x: x.split())
df['Clean_text'].head()

0    [please, check, what, happens, this, not, woki...
1    [network, fluctuations, and, speed, pathetic, ...
2    [this, has, been, going, since, this, absolute...
3    [have, done, recharge, but, haven, got, perday...
4    [but, when, still, not, received, any, call, f...
Name: Clean_text, dtype: object

## Lemmatizing

Lemmatizing is a technique used to find the root of words, in order to group them by meaning rather than exact form.

In [13]:
from nltk.stem import WordNetLemmatizer

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) for word in text]
    text = lemmatized
    return lemmatized

In [14]:
df['Clean_text'] = df['Clean_text'].apply(lemmatize_text)

In [15]:
df['Clean_text'].head()

0    [please, check, what, happens, this, not, woki...
1    [network, fluctuation, and, speed, pathetic, n...
2    [this, ha, been, going, since, this, absolutel...
3    [have, done, recharge, but, haven, got, perday...
4    [but, when, still, not, received, any, call, f...
Name: Clean_text, dtype: object

In [16]:
for i in range(len(df['Clean_text'])):
    df['Clean_text'][i] = ' '.join(df['Clean_text'][i])

## Removing duplicates

In [17]:
df.shape

(21047, 5)

In [21]:
# Removing any tweets with exactly duplicated text or tweets with no text
df.drop_duplicates(subset=['Clean_text'], keep = 'first',inplace= True)

In [22]:
# after dropping duplicates, I reset the index
df.reset_index(drop=True,inplace=True)

In [23]:
df.shape

(19754, 5)

## Indentifying special instances of tweets

In [26]:
# Check for tweets with length 0
df['Clean_text_length'] = df['Clean_text'].apply(len)

In [27]:
df[df['Clean_text_length']==0]

Unnamed: 0,username,date,tweet,mentions,Clean_text,Clean_text_length
20,omanmessi,10-06-2020,@VodafoneIN,"['ooredoooman', 'vodafonein']",,0


In [28]:
# check that this tweet with length 0 is not an artifact of previous preprocessing
raw_data[raw_data['username']=='omanmessi']

Unnamed: 0,username,date,tweet,mentions,Clean_text,Clean_text_length
20,omanmessi,10-06-2020,@VodafoneIN,"['ooredoooman', 'vodafonein']",,0


In [31]:
# We can simply drop these tweets
indexes = df[df['Clean_text_length']==0]['Clean_text'].index
indexes

Int64Index([20], dtype='int64')

In [32]:
df.drop(index = indexes,inplace=True)

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19753 entries, 0 to 19753
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   username           19753 non-null  object
 1   date               19753 non-null  object
 2   tweet              19753 non-null  object
 3   mentions           19753 non-null  object
 4   Clean_text         19753 non-null  object
 5   Clean_text_length  19753 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 1.1+ MB


In [34]:
df.reset_index(drop=True,inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19753 entries, 0 to 19752
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   username           19753 non-null  object
 1   date               19753 non-null  object
 2   tweet              19753 non-null  object
 3   mentions           19753 non-null  object
 4   Clean_text         19753 non-null  object
 5   Clean_text_length  19753 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 926.0+ KB


In [35]:
df['Clean_text'].head()

0    please check what happens this not woking sinc...
1    network fluctuation and speed pathetic need ji...
2    this ha been going since this absolutely unpro...
3    have done recharge but haven got perday with u...
4    but when still not received any call from cust...
Name: Clean_text, dtype: object

# Vectorizer