## Student Name: Akhila Mora
## Student Email: akhila.mora@ou.edu

# Project 3: The Smart City Slicker

Imagine you are a stakeholder in a rising Smart City and want to know more about themes and concepts about existing smart cities. You also want to know where does your smart city place among others. In this project, you will perform 
exploratory data analysis, often shortened to EDA, to examine a data from the [2015 Smart City Challenge](https://www.transportation.gov/smartcity) to find facts about the data and communicating those facts through text analysis and visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as data preprocessing or cleaning.
Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring.
Because of this tight coupling, you have to clean the data as necessary to help understand the data.

In this project, you will apply your knowledge about data cleaning, machine learning, visualizations, and databases to explore smart city applications.

**Part 1** of the notebook will explore and clean the data. \
**Part 2** will take the results of the preprocessed data to create models and visualizations.

Empty cells are code cells. 
Cells denoted with [Your Answer Here] are markdown cells.
Edit and add as many cells as needed.

Output file for this notebook is shown as a table for display purposes. Note: The city name can be Norman, OK or OK Norman.

| city | raw text | clean text | clusterid | topicids | summary | keywords|
| -- | -- | -- | -- | -- | -- | -- |
|Norman, OK | Test, test , and testing. | test test test | 0 | T1, T2| test | test |

## Introduction
The Dataset: 2015 Smart City Challenge Applicants (non-finalist).
In this project you will use the applicant's PDFs as a dataset.
The dataset is from the U.S Department of Transportation Smart City Challenge.

On the website page for the data, you can find some basic information about the challenge. This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

1. Can I identify frequently occurring words that could be removed during data preprocessing?
2. Where are the applicants from?
3. Are there multiple entries for the same city in different applicantions?
4. What are the major themes and concepts from the smart city applicants?

Let's load the data!

## Loading and Handling files

Load data from `smartcity/`. 

To extract the data from the pdf files, use the [pypdf.pdf.PdfFileReader](https://pypdf.readthedocs.io/en/stable/index.html) class.
It will allow you to extract pages and pdf files and add them to a data structure (dataframe, list, dictionary, etc).
To install the module, use the command `pipenv install pypdf`.
You only need to handle PDF files, handling docx is not necessary.

In [1]:
from pypdf import PdfReader
import pandas as pd
import os
import pypdf
print(pypdf.__version__)

3.8.1


Create a data structure to add the city name and raw text. You can choose to split the city name from the file.

In [2]:
import re
dataframes = []
c_names = []
for f in os.listdir('./smartcity'):
    #data = []
    name = os.path.splitext(f)[0]
    fp = os.path.join('./smartcity',f)
    #print(fp)

    if name != ".ipynb_checkpoints" and os.path.splitext(f)[1] == ".pdf":
        t_nm = f.split('.pdf')[0]
        print(t_nm)
        req_nm = ' '.join(t_nm.split(" ")[1:]) +" "+ t_nm.split(" ")[0]
        print(req_nm)
        c_names.append(req_nm)
        with open(fp, 'rb') as file:
            pdf_reader = PdfReader(file)
            text = ''
            #print(len(pdf_reader.pages))
            for p in range(len(pdf_reader.pages)):
                pg = pdf_reader.pages[p]
                pg_text=pg.extract_text()
                if 'Contents' not in pg_text: 
                    text = text + pg.extract_text()
                    #print(text)
                    #data.append({'text':text})
            dataframes.append(text)

            #df = pd.DataFrame(data)
            #dataframes.append(df)

data = {'raw_text':dataframes}
df = pd.DataFrame(data)
df['city'] = c_names
print(df)
#df_combined = pd.concat(dataframes, ignore_index = True)
#print(df_combined)
    

GA Brookhaven
Brookhaven GA
NY Buffalo
Buffalo NY
CA Riverside
Riverside CA
AZ Scottsdale AZ
Scottsdale AZ AZ
FL Jacksonville
Jacksonville FL
LA New Orleans
New Orleans LA
AL Montgomery
Montgomery AL
MI Port Huron and Marysville
Port Huron and Marysville MI
WA Seattle
Seattle WA
LA Shreveport
Shreveport LA
WA Spokane
Spokane WA
IN Indianapolis
Indianapolis IN
AL Birmingham
Birmingham AL
LA Baton Rouge
Baton Rouge LA
FL Miami
Miami FL
CA Oceanside
Oceanside CA
CA San Jose_0
San Jose_0 CA
NE Lincoln
Lincoln NE
MA Boston
Boston MA
CA Sacramento
Sacramento CA
VA Richmond
Richmond VA
GA Atlanta
Atlanta GA
NY Rochester
Rochester NY
TN Memphis
Memphis TN
NC Raleigh
Raleigh NC
NY Albany Troy Schenectady Saratoga Springs
Albany Troy Schenectady Saratoga Springs NY
OH Cleveland
Cleveland OH
NC Charlotte
Charlotte NC
NJ Jersey City
Jersey City NJ
CA Chula Vista
Chula Vista CA
CA Long Beach
Long Beach CA
MI Detroit
Detroit MI
IA Des Moines
Des Moines IA
MO St. Louis
St. Louis MO
NE Omaha
Omaha NE


## Cleaning Up PDFs

One of the more frustrating aspects of PDF is loading the data into a readable format. The first order of business will be to preprocess the data. To start, you can use code provided by Text Analytics with Python, [Chapter 3](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch03%20-%20Processing%20and%20Understanding%20Text/Ch03a%20-%20Text%20Wrangling.ipynb): [contractions.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/contractions.py) (Pages 136-137), and [text_normalizer.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/text_normalizer.py) (Pages 155-156). Feel free to download the scripts or add the code directly to the notebook (please note this code is performed on dataframes).

In addition to the data cleaning provided by the textbook, you will need to:
1. Consider removing terms that may effect clustering and topic modeling. Words to consider are cities, states, common words (smart, city, page, etc.). Keep in mind n-gram combinations are important; this can also be revisited later depending on your model's performance.
2. Check the data to remove applicants that text was not processed correctly. Do not remove more than 15 cities from the data.


In [3]:
import nltk
import spacy
import unicodedata
import re
from nltk.corpus import wordnet
import collections
import en_core_web_md
#from textblob import Word
from nltk.tokenize.toktok import ToktokTokenizer
from bs4 import BeautifulSoup

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
nlp = spacy.load("en_core_web_md")
# nlp_vec = spacy.load('en_vectors_web_lg', parse=True, tag=True, entity=True)


CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    if bool(soup.find()):
        [s.extract() for s in soup(['iframe', 'script'])]
        stripped_text = soup.get_text()
        stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    else:
        stripped_text = text
    return stripped_text


#def correct_spellings_textblob(tokens):
#	return [Word(token).correct() for token in tokens]  


def simple_porter_stemming(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text


def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text


def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word
            
    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens


def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text


def remove_accented_chars(text):
    text = unicodedata.normalize('NFKC', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text


def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]|\[|\]' if not remove_digits else r'[^a-zA-Z\s]|\[|\]'
    text = re.sub(pattern, '', text)
    return text


def remove_stopwords(text, is_lower_case=False, stopwords=stopword_list):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_stemming=False, text_lemmatization=True, 
                     special_char_removal=True, remove_digits=True,
                     stopword_removal=True, stopwords=stopword_list):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        
        
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)

        # remove extra newlines
        doc = doc.translate(doc.maketrans("\n\t\r", "   "))

        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)

        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)

        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)

        # stem text
        if text_stemming and not text_lemmatization:
        	doc = simple_porter_stemming(doc)

        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  

        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)

         # lowercase the text    
        if text_lower_case:
            doc = doc.lower()

        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case, stopwords=stopwords)

        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        doc = doc.strip()
            
        normalized_corpus.append(doc)
         
    return normalized_corpus

In [4]:
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')
corpus = df['raw_text'][0:68]

# Tokenize and remove stopwords
tokens = [word.lower() for sent in corpus for word in sent.split() if word.lower() not in stopwords.words('english')]

# Generate frequency distribution
freq_dist = nltk.FreqDist(tokens)

# Print most common words
print(freq_dist.most_common(10))

[nltk_data] Downloading package stopwords to /home/akhila/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('city', 9104), ('smart', 5588), ('data', 4888), ('transportation', 3964), ('transit', 2624), ('•', 2423), ('traffic', 2249), ('system', 2223), ('public', 2176), ('.', 2038)]


In [5]:
print(df['raw_text'][0:68])

0     “Buford Highway through DeKalb County is the m...
1      \n  \nU.S. Department of Transportation \nNot...
2     CITY OF RIVERSIDE\nCALIFORNIA\nApplication For...
3       \n  \n \n \n \nFederal Agency Name:   U.S. D...
4     Beyond Traffic: The Smart City Challenge \nCon...
                            ...                        
63    City of Norfolk, VA\n*\nResponse Proposal to U...
64     \n   \n    \n \nSmart  DC \nMaking  the Distr...
65    BEYOND TRAFFIC: THE SMART CITY CHALLENGE - VIS...
66      \n1.  Project Vision  .........................
67    1 of 27 \n Executive summary : Creating a Vibr...
Name: raw_text, Length: 68, dtype: object


In [6]:
req_dforf1 = normalize_corpus(df['raw_text'][0:68], html_stripping=False, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_stemming=False, text_lemmatization=True, 
                     special_char_removal=True, remove_digits=True,
                     stopword_removal=True, stopwords=stopword_list)

In [7]:
print(len(req_dforf1))


68


#### Add the cleaned text to the structure you created.


In [8]:
print(type(req_dforf1))
df['clean_text'] = req_dforf1

<class 'list'>


In [10]:
!pip install scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer



In [11]:
vector = TfidfVectorizer(ngram_range=(1,3))

In [12]:
train_vector = vector.fit_transform(df['clean_text'])

In [13]:
train_vector

<68x714926 sparse matrix of type '<class 'numpy.float64'>'
	with 963865 stored elements in Compressed Sparse Row format>

In [14]:
from sklearn.cluster import KMeans

In [15]:
kmean = KMeans(n_clusters=9)

In [16]:
kmean.fit(train_vector)



In [17]:
kmean.labels_

array([4, 2, 2, 2, 5, 5, 5, 5, 5, 5, 2, 1, 7, 5, 2, 5, 5, 0, 5, 2, 5, 4,
       7, 2, 5, 2, 2, 2, 8, 7, 5, 5, 1, 2, 2, 1, 5, 3, 2, 2, 2, 2, 0, 2,
       5, 2, 5, 2, 2, 2, 0, 0, 2, 5, 5, 1, 6, 5, 8, 2, 2, 0, 2, 5, 2, 2,
       5, 6], dtype=int32)

In [18]:
from sklearn.metrics import silhouette_score


In [19]:
labels = kmean.predict(train_vector)


In [20]:
silhouette_avg9 = silhouette_score(train_vector, labels)
silhouette_avg9

-0.12378281736070233

In [21]:
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score

In [22]:
ch_score9 = calinski_harabasz_score(train_vector.toarray(), labels)

# Compute the Davies-Bouldin score for the clustering result
db_score9 = davies_bouldin_score(train_vector.toarray(), labels)

print("The Calinski and Harabasz score is 9:", ch_score9)
print("The Davies-Bouldin score is 9: ", db_score9)

The Calinski and Harabasz score is 9: 1.170837713307526
The Davies-Bouldin score is 9:  3.2808945615202623


In [23]:
kmean = KMeans(n_clusters=18)
kmean.fit(train_vector)
labels = kmean.predict(train_vector)
silhouette_avg18 = silhouette_score(train_vector, labels)
print("Silhouette score 18:",silhouette_avg18)
ch_score18 = calinski_harabasz_score(train_vector.toarray(), labels)

db_score18 = davies_bouldin_score(train_vector.toarray(), labels)

print("The Calinski and Harabasz score is 18:", ch_score18)
print("The Davies-Bouldin score is 18: ", db_score18)




Silhouette score 18: 0.00048224800770234647
The Calinski and Harabasz score is 18: 1.0931813213798938
The Davies-Bouldin score is 18:  0.9273446098570951


In [24]:
kmean = KMeans(n_clusters=36)
kmean.fit(train_vector)
labels = kmean.predict(train_vector)
silhouette_avg36 = silhouette_score(train_vector, labels)
print("Silhouette score 18:",silhouette_avg36)
ch_score36 = calinski_harabasz_score(train_vector.toarray(), labels)

db_score36 = davies_bouldin_score(train_vector.toarray(), labels)

print("The Calinski and Harabasz score is 18:", ch_score36)
print("The Davies-Bouldin score is 18: ", db_score36)



Silhouette score 18: -0.0008394504056384801
The Calinski and Harabasz score is 18: 1.1293725526318013
The Davies-Bouldin score is 18:  0.895044187756752


In [25]:
from scipy.cluster.hierarchy import linkage

from scipy.cluster.hierarchy import fcluster

Z = linkage(train_vector.toarray(), method='ward')
labels = fcluster(Z, t=9, criterion='maxclust')
                  
silhouette_avg9 = silhouette_score(train_vector, labels)
print("Silhouette score 9:",silhouette_avg9)
ch_score9 = calinski_harabasz_score(train_vector.toarray(), labels)

db_score9 = davies_bouldin_score(train_vector.toarray(), labels)

print("The Calinski and Harabasz score is 9:", ch_score9)
print("The Davies-Bouldin score is 9: ", db_score9)                  

Silhouette score 9: 0.014942223766580564
The Calinski and Harabasz score is 9: 1.2683450500290385
The Davies-Bouldin score is 9:  2.753847577121421


In [26]:
Z = linkage(train_vector.toarray(), method='ward')
labels = fcluster(Z, t=18, criterion='maxclust')
                  
silhouette_avg18 = silhouette_score(train_vector, labels)
print("Silhouette score 18:",silhouette_avg18)
ch_score18 = calinski_harabasz_score(train_vector.toarray(), labels)

db_score18 = davies_bouldin_score(train_vector.toarray(), labels)

print("The Calinski and Harabasz score is 18:", ch_score18)
print("The Davies-Bouldin score is 18: ", db_score18)

Silhouette score 18: 0.008730683304876937
The Calinski and Harabasz score is 18: 1.2239722006234135
The Davies-Bouldin score is 18:  2.1036548709699487


In [27]:
Z = linkage(train_vector.toarray(), method='ward')
labels = fcluster(Z, t=36, criterion='maxclust')
                  
silhouette_avg36 = silhouette_score(train_vector, labels)
print("Silhouette score 36:",silhouette_avg36)
ch_score36 = calinski_harabasz_score(train_vector.toarray(), labels)

db_score36 = davies_bouldin_score(train_vector.toarray(), labels)

print("The Calinski and Harabasz score is 36:", ch_score36)
print("The Davies-Bouldin score is 36: ", db_score36)

Silhouette score 36: -0.0021525454066276253
The Calinski and Harabasz score is 36: 1.2439593034433338
The Davies-Bouldin score is 36:  1.348871215464741


In [28]:
from sklearn.cluster import DBSCAN

# Create a DBSCAN object
dbscan = DBSCAN(eps=1.5, min_samples=5)

# Fit the DBSCAN object on the vectorized data
dbscan.fit(train_vector)

# Predict the cluster labels for the vectorized data
labels = dbscan.labels_

# Count the number of clusters
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)


Estimated number of clusters: 1


In [29]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from scipy.cluster.hierarchy import linkage, fcluster

# Generate random data
X = train_vector

# Define range of k values to evaluate
k_values = range(2, 24)

# Initialize lists to store results
ch_scores_kmeans = []
silhouette_scores_kmeans = []
davies_bouldin_scores_kmeans = []

ch_scores_linkage = []
silhouette_scores_linkage = []
davies_bouldin_scores_linkage = []

# Iterate over k values
for k in k_values:
    # Fit k-means model
    kmeans_model = KMeans(n_clusters=k, random_state=42).fit(X)
    
    # Calculate metrics for k-means
    ch_score_kmeans = calinski_harabasz_score(X.toarray(), kmeans_model.labels_)
    silhouette_score_kmeans = silhouette_score(X, kmeans_model.labels_)
    davies_bouldin_score_kmeans = davies_bouldin_score(X.toarray(), kmeans_model.labels_)
    
    # Append results to lists
    ch_scores_kmeans.append(ch_score_kmeans)
    silhouette_scores_kmeans.append(silhouette_score_kmeans)
    davies_bouldin_scores_kmeans.append(davies_bouldin_score_kmeans)
    
    # Perform hierarchical clustering with linkage
    linkage_matrix = linkage(X.toarray(), method='ward')
    linkage_labels = fcluster(linkage_matrix, k, criterion='maxclust')
    
    # Calculate metrics for linkage
    ch_score_linkage = calinski_harabasz_score(X.toarray(), linkage_labels)
    silhouette_score_linkage = silhouette_score(X, linkage_labels)
    davies_bouldin_score_linkage = davies_bouldin_score(X.toarray(), linkage_labels)
    
    # Append results to lists
    ch_scores_linkage.append(ch_score_linkage)
    silhouette_scores_linkage.append(silhouette_score_linkage)
    davies_bouldin_scores_linkage.append(davies_bouldin_score_linkage)
    
# Create DataFrame with results
results_df = pd.DataFrame({
    'K': k_values,
    'CH Score (K-Means)': ch_scores_kmeans,
    'Silhouette Score (K-Means)': silhouette_scores_kmeans,
    'Davies-Bouldin Score (K-Means)': davies_bouldin_scores_kmeans,
    'CH Score (Linkage)': ch_scores_linkage,
    'Silhouette Score (Linkage)': silhouette_scores_linkage,
    'Davies-Bouldin Score (Linkage)': davies_bouldin_scores_linkage
})




In [30]:
results_df

Unnamed: 0,K,CH Score (K-Means),Silhouette Score (K-Means),Davies-Bouldin Score (K-Means),CH Score (Linkage),Silhouette Score (Linkage),Davies-Bouldin Score (Linkage)
0,2,1.203952,0.046406,2.220843,1.337333,0.036296,3.559832
1,3,1.278319,-0.004435,4.913963,1.324281,0.031307,2.464872
2,4,1.19989,0.010224,4.661496,1.320808,0.031804,1.720088
3,5,1.153897,-0.102977,4.439481,1.312459,0.012864,3.438031
4,6,1.166683,-0.054472,2.715054,1.301557,0.012811,3.306187
5,7,1.150705,-0.101835,3.686196,1.291308,0.014895,3.05587
6,8,1.156645,-0.002306,2.934716,1.278249,0.014904,2.996231
7,9,1.185117,0.013614,2.404045,1.268345,0.014942,2.753848
8,10,1.140845,-0.095776,1.933558,1.260104,0.014885,2.561505
9,11,1.110942,-0.115641,1.742494,1.253642,0.014691,2.507789


### Clean Up: Discussion
Answer the questions below.

#### Which Smart City applicants did you remove? What issues did you see with the documents?

Removed FL Tallahassee.pdf as it contains text in images and it is not possible for the pypdf to extract text from images.
Removed NM Albuquerque.docx and GA Columbus.docx as they are in docx format not in pdf format

#### Explain what additional text processing methods you used and why.

Vectorization

#### Did you identify any potientally problematic words?

Nope

## Experimenting with Clustering Models

Now, you'll start to explore models to find the optimal clustering model. In this section, you'll explore [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [Hierarchical](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) clustering algorithms.
Create these algorithms with k_clusters for K-means and Hierarchical.
For each cell in the table provide the [Silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), [Calinski and Harabasz score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score), and [Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score).

In each cell, create an array to store the values.
For example, 

|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means| [S,CH,DB]| [S,CH,DB] | [S,CH,DB] | [S,CH,DB] |
|Hierarchical |[S,CH,DB]| [S,CH,DB]| [S,CH,DB] | [S,CH,DB]|
|DBSCAN | X | X | X | [S,CH,DB] |



### Optimality 
You will need to find the optimal k for K-means and Hierarchical algorithms.
Find the optimality for k in the range 2 to 50.
Provide the code used to generate the optimal k and provide justification for your approach.


|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means|--|--|--|--|
|Hierarchical |--|--|--|--|
|DBSCAN | X | X | X | -- |



#### How did you approach finding the optimal k?

[Your answer here]

#### What algorithm do you believe is the best? Why?

[Your Answer]

### Add Cluster ID to output file
In your data structure, add the cluster id for each smart city respectively. Show the to append the clusterid code below.

In [31]:
kmean = KMeans(n_clusters=13)
kmean.fit(train_vector)
cluster_labels = kmean.labels_

# Add cluster labels to dataframe
df['clusterid'] = cluster_labels



In [32]:
df['clusterid']

0     4
1     5
2     9
3     5
4     5
     ..
63    1
64    5
65    5
66    5
67    5
Name: clusterid, Length: 68, dtype: int32

### Save Model

After finding the best model, it is desirable to have a way to persist the model for future use without having to retrain. Save the model using [model persistance](https://scikit-learn.org/stable/model_persistence.html). This model should be saved in the same directory as this notebook and should be loaded as the model for your `project3.py`.

Save the model as `model.pkl`. You do not have to use pickle, but be sure to save the persistance using one of the methods listed in the link.

In [33]:
import pickle


with open('model.pkl', 'wb') as f:
    pickle.dump(kmean, f)


## Derving Themes and Concepts

Perform Topic Modeling on the cleaned data. Provide the top five words for `TOPIC_NUM = Best_k` as defined in the section above. Feel free to reference [Chapter 6](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models) for more information on Topic Modeling and Summarization.

In [34]:
!pip install -U scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=20, max_df=0.6, ngram_range=(1,2),
                     token_pattern=None, tokenizer=lambda doc: doc,
                     preprocessor=lambda doc: doc)
cv_features = cv.fit_transform(df['clean_text'])
cv_features.shape



(68, 67)

In [35]:
vocabulary = np.array(cv.get_feature_names_out())
print('Total Vocabulary Size:', len(vocabulary))


Total Vocabulary Size: 67


In [36]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Convert text data into matrix of token counts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['clean_text'])

# Perform topic modeling using Latent Dirichlet Allocation (LDA)
lda_model = LatentDirichletAllocation(n_components=5, random_state=0)
lda_model.fit(X)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Print the top 5 words for each topic
for topic_idx, topic in enumerate(lda_model.components_):
    top_words_idx = topic.argsort()[:-6:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"Topic {topic_idx}: {' '.join(top_words)}")


Topic 0: city transportation smart datum system
Topic 1: city smart transportation system vehicle
Topic 2: city smart datum vehicle use
Topic 3: city smart transportation system datum
Topic 4: city smart transportation vehicle system


### Extract themes
Write a theme for each topic (atleast a sentence each).

[Your Answer]

[Your Answer]

[Your Answer]

### Add Topid ID to output file
Add the top two topics for each smart city to the data structure.

In [48]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Create a dictionary to store top topics for each city
city_topics = {}


city_topics[city] = []
for topic_idx, topic in enumerate(lda_model.components_):
    top_words_idx = topic.argsort()[:-6:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    city_topics[city].append({'topic': topic_idx, 'words': top_words})


In [49]:
city_topics

{'Lubbock TX': [{'topic': 0,
   'words': ['city', 'new', 'smart', 'datum', 'use']},
  {'topic': 1, 'words': ['city', 'new', 'smart', 'datum', 'use']}]}

## Gathering Applicant Summaries and Keywords

For each smart city applicant, gather a summary and keywords that are important to that document. You can use gensim to do this. Here are examples of functions that you could use.

```python

from gensim.summarization import summarize

def summary(text, ratio=0.2, word_count=250, split=False):
    return summarize(text, ratio= ratio, word_count=word_count, split=split)
    
from gensim.summarization import keywords

def keys(text, ratio=0.01):
    return keywords(text, ratio=ratio)
```

### Add Summaries and Keywords
Add summary and keywords to output file.

## Write output data

The output data should be written as a TSV file.
You can use `to_csv` method from Pandas for this if you are using a DataFrame.

`Syntax: df.to_csv('file.tsv', sep = '')` \
`df.to_csv('smartcity_eda.tsv', sep='\t')`

In [58]:
df.to_csv('smartcity_eda.tsv', sep='\t', escapechar='\\')


# Moving Forward
Now that you have explored the dataset, take the important features and functions to create your `project3.py`.
Please refer to the project spec for more guidance.
