### The assignment is to identify a set of topics on the vote data. 

Requirements:
1. Do text preprocessing (e.g., stopword removal, lemmatization, stemming, etc.)
2. TF-IDF text representation
3. Run LDA
4. Identify the optimal number of topics
5. Show top 10 words for each topic.

## 0. Read in file

In [1]:
import pandas as pd 
dataset = pd.read_csv("dataset.csv") 

In [2]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5101 entries, 0 to 5100
Data columns (total 15 columns):
Title          5101 non-null object
Subtitle       2721 non-null object
Owner          3191 non-null object
Votes          2885 non-null float64
Last Update    2885 non-null object
Tags           1700 non-null object
Data Type      2885 non-null object
Size           2885 non-null object
License        2885 non-null object
Views          2874 non-null object
Download       2803 non-null object
Kernels        1276 non-null object
Topics         2243 non-null object
URL            2885 non-null object
Description    2874 non-null object
dtypes: float64(1), object(14)
memory usage: 597.9+ KB


In [3]:
dataset.head(5)

Unnamed: 0,Title,Subtitle,Owner,Votes,Last Update,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1233.0,05/11/2016,crime\nfinance,CSV,144 MB,ODbL,"440,221 views","52,793 downloads","1,778 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1035.0,24/10/2016,association football\neurope,SQLite,299 MB,ODbL,"393,924 views","46,025 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1018.0,28/09/2017,film,CSV,44 MB,Other,"444,535 views","61,705 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...
3,Human Resources Analytics,Why are our best and most experienced employee...,Ludovic Benistant,832.0,30/11/2016,employment,CSV,553 KB,CC4,"309,644 views","47,350 downloads","1,772 kernels",32 topics,https://www.kaggle.com/ludobenistant/hr-analytics,This dataset is simulated\nWhy are our best an...
4,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,785.0,19/07/2017,crime\nterrorism\ninternational relations,CSV,144 MB,Other,"186,621 views","26,091 downloads",609 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr..."


In [4]:
dataset.tail(5)

Unnamed: 0,Title,Subtitle,Owner,Votes,Last Update,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
5096,House Data,,Varun Kashyap.K.S.,1.0,02/11/2017,,CSV,2 MB,Other,67 views,12 downloads,,0 topics,https://www.kaggle.com/varunkashyapks/house-data,This dataset does not have a description yet.
5097,Super Market Product,,Varun Kashyap.K.S.,1.0,03/11/2017,,Other,510 MB,Other,141 views,35 downloads,,0 topics,https://www.kaggle.com/varunkashyapks/super-ma...,This dataset does not have a description yet.
5098,Super Market Products,,Varun Kashyap.K.S.,1.0,03/11/2017,,CSV,179 MB,Other,138 views,8 downloads,,0 topics,https://www.kaggle.com/varunkashyapks/super-ma...,This dataset does not have a description yet.
5099,Sales Data,,Kamau John,1.0,02/11/2017,,CSV,988 B,CC0,372 views,38 downloads,,0 topics,https://www.kaggle.com/sophicist/sales-data,This dataset does not have a description yet.
5100,Titanic,,Marcel,1.0,03/11/2017,,CSV,88 KB,Other,337 views,57 downloads,,0 topics,https://www.kaggle.com/mkempers/titanic,This dataset does not have a description yet.


### Approach for this assignment: After looking at the content of the file, the author decides that the features title, subtitle as well as tags can be used for topic modeling.

In [5]:
topics = pd.read_csv("dataset.csv", delimiter=';', names = ['Title', 'Subtitle', 'Tags'])

In [6]:
topics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44183 entries, 0 to 44182
Data columns (total 3 columns):
Title       44183 non-null object
Subtitle    442 non-null object
Tags        80 non-null object
dtypes: object(3)
memory usage: 1.0+ MB


##  1. Text preprocessing

This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words

In [18]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

# use regex to search for all non-letters; replace all non-letters with spaces, define column to search in
stemmer = SnowballStemmer('english')

def TopicCleaning(Title ):
    letters_only = re.sub("[^\w\s]", " ", str(Title))
    words = word_tokenize(letters_only.lower())                  
    stops = set(stopwords.words("english"))      
    meaningful_words = [w for w in words if not w in stops]
    return ' '.join(stemmer.stem(w) for w in meaningful_words)

### 1.1 Test function

In [19]:
clean_Title = TopicCleaning(dataset["Title"][3])
clean_Subtitle = TopicCleaning(dataset["Subtitle"][3])
clean_Tags = TopicCleaning(dataset["Tags"][3])

print(clean_Title)
print(clean_Subtitle)
print(clean_Tags)

human resourc analyt
best experienc employe leav prematur
employ


### 1.2 Add to new file

In [29]:
dataset = pd.read_csv("dataset.csv") 

clean_Title = TopicCleaning(dataset["Title"])

dataset['clean_Title'] = clean_Title

In [37]:
import csv
with open("dataset.csv", "wb") as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        for line in dataset:
            writer.writerow(line)

TypeError: a bytes-like object is required, not 'str'

In [30]:
clean_Title

'0 credit card fraud detect 1 european soccer databas 2 tmdb 5000 movi dataset 3 human resourc analyt 4 global terror databas 5 bitcoin histor data 6 kaggl ml data scienc survey 2017 7 iri speci 8 world develop indic 9 daili news stock market predict 10 pokemon stat 11 lend club loan data 12 wine review 13 climat chang earth surfac temperatur data 14 open food fact 15 pay attend colleg 16 mushroom classif 17 breast cancer wisconsin diagnost data set 18 amazon fine food review 19 world happi report 20 fashion mnist 21 2016 us elect 22 game throne 23 speed date experi 24 footbal event 25 hous sale king counti usa 26 zillow econom data 27 movi dataset 28 new york stock exchang 29 ted talk 5071 train_ml4_her 5072 breast cancer diagnosi 5073 cyber crime 5074 googl search interest hurrican irma day 5075 frightgeist 2017 rank costum 5076 noshow 5077 h1 b analysi 5078 predict network attack 5079 onlinenewspopular 5080 titan machin learn disast 5081 russian financi indic 5082 network attack 508

In [31]:
dataset.head(5)

Unnamed: 0,Title,Subtitle,Owner,Votes,Last Update,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description,clean_Title
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1233.0,05/11/2016,crime\nfinance,CSV,144 MB,ODbL,"440,221 views","52,793 downloads","1,778 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...,0 credit card fraud detect 1 european soccer d...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1035.0,24/10/2016,association football\neurope,SQLite,299 MB,ODbL,"393,924 views","46,025 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...,0 credit card fraud detect 1 european soccer d...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1018.0,28/09/2017,film,CSV,44 MB,Other,"444,535 views","61,705 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...,0 credit card fraud detect 1 european soccer d...
3,Human Resources Analytics,Why are our best and most experienced employee...,Ludovic Benistant,832.0,30/11/2016,employment,CSV,553 KB,CC4,"309,644 views","47,350 downloads","1,772 kernels",32 topics,https://www.kaggle.com/ludobenistant/hr-analytics,This dataset is simulated\nWhy are our best an...,0 credit card fraud detect 1 european soccer d...
4,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,785.0,19/07/2017,crime\nterrorism\ninternational relations,CSV,144 MB,Other,"186,621 views","26,091 downloads",609 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr...",0 credit card fraud detect 1 european soccer d...


In [17]:
dataset_new = pd.read_csv("dataset_new.csv")
dataset_new.head(5)# Add cleaned data back into DataFrame

dataset_new['clean_Title'] = clean_Title




dataset = pd.read_csv("dataset.csv") 

clean_Title = TopicCleaning(dataset["Title"])

# Add cleaned data back into DataFrame
dataset.to_csv("dataset_new.csv", index=False)
dataset_new = pd.read_csv("dataset_new.csv") 



dataset = pd.read_csv("dataset.csv") 

clean_Title = TopicCleaning(dataset["Title"])

# Add cleaned data back into DataFrame
dataset.to_csv("dataset_new.csv", index=False)

Unnamed: 0,Title,Subtitle,Owner,Votes,Last Update,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1233.0,05/11/2016,crime\r\nfinance,CSV,144 MB,ODbL,"440,221 views","52,793 downloads","1,778 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1035.0,24/10/2016,association football\r\neurope,SQLite,299 MB,ODbL,"393,924 views","46,025 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1018.0,28/09/2017,film,CSV,44 MB,Other,"444,535 views","61,705 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\r\nWhat can we say about the succes...
3,Human Resources Analytics,Why are our best and most experienced employee...,Ludovic Benistant,832.0,30/11/2016,employment,CSV,553 KB,CC4,"309,644 views","47,350 downloads","1,772 kernels",32 topics,https://www.kaggle.com/ludobenistant/hr-analytics,This dataset is simulated\r\nWhy are our best ...
4,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,785.0,19/07/2017,crime\r\nterrorism\r\ninternational relations,CSV,144 MB,Other,"186,621 views","26,091 downloads",609 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\r\nInformation on more than 170,000 Te..."


In [12]:
dataset.head(5)

Unnamed: 0,Title,Subtitle,Owner,Votes,Last Update,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
0,Credit Card Fraud Detection,Anonymized credit card transactions labeled as...,Machine Learning Group - ULB,1233.0,05/11/2016,crime\nfinance,CSV,144 MB,ODbL,"440,221 views","52,793 downloads","1,778 kernels",26 topics,https://www.kaggle.com/mlg-ulb/creditcardfraud,The datasets contains transactions made by cre...
1,European Soccer Database,"25k+ matches, players & teams attributes for E...",Hugo Mathien,1035.0,24/10/2016,association football\neurope,SQLite,299 MB,ODbL,"393,924 views","46,025 downloads","1,459 kernels",75 topics,https://www.kaggle.com/hugomathien/soccer,The ultimate Soccer database for data analysis...
2,TMDB 5000 Movie Dataset,"Metadata on ~5,000 movies from TMDb",The Movie Database (TMDb),1018.0,28/09/2017,film,CSV,44 MB,Other,"444,535 views","61,705 downloads","1,394 kernels",46 topics,https://www.kaggle.com/tmdb/tmdb-movie-metadata,Background\nWhat can we say about the success ...
3,Human Resources Analytics,Why are our best and most experienced employee...,Ludovic Benistant,832.0,30/11/2016,employment,CSV,553 KB,CC4,"309,644 views","47,350 downloads","1,772 kernels",32 topics,https://www.kaggle.com/ludobenistant/hr-analytics,This dataset is simulated\nWhy are our best an...
4,Global Terrorism Database,"More than 170,000 terrorist attacks worldwide,...",START Consortium,785.0,19/07/2017,crime\nterrorism\ninternational relations,CSV,144 MB,Other,"186,621 views","26,091 downloads",609 kernels,11 topics,https://www.kaggle.com/START-UMD/gtd,"Context\nInformation on more than 170,000 Terr..."


##  02. TF-IDF Transformation