## Student Name: 
## Student Email:

# Project 3: The Smart City Slicker

Imagine you are a stakeholder in a rising Smart City and want to know more about themes and concepts about existing smart cities. You also want to know where does your smart city place among others. In this project, you will perform 
exploratory data analysis, often shortened to EDA, to examine a data from the [2015 Smart City Challenge](https://www.transportation.gov/smartcity) to find facts about the data and communicating those facts through text analysis and visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as data preprocessing or cleaning.
Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring.
Because of this tight coupling, you have to clean the data as necessary to help understand the data.

In this project, you will apply your knowledge about data cleaning, machine learning, visualizations, and databases to explore smart city applications.

**Part 1** of the notebook will explore and clean the data. \
**Part 2** will take the results of the preprocessed data to create models and visualizations.

Empty cells are code cells. 
Cells denoted with [Your Answer Here] are markdown cells.
Edit and add as many cells as needed.

Output file for this notebook is shown as a table for display purposes. Note: The city name can be Norman, OK or OK Norman.

| city | raw text | clean text | clusterid | topicids | 
| -- | -- | -- | -- | -- | 
|Norman, OK | Test, test , and testing. | test test test | 0 | T1, T2| 

## Introduction
The Dataset: 2015 Smart City Challenge Applicants (non-finalist).
In this project you will use the applicant's PDFs as a dataset.
The dataset is from the U.S Department of Transportation Smart City Challenge.

On the website page for the data, you can find some basic information about the challenge. This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

1. Can I identify frequently occurring words that could be removed during data preprocessing?
2. Where are the applicants from?
3. Are there multiple entries for the same city in different applicantions?
4. What are the major themes and concepts from the smart city applicants?

Let's load the data!

## Loading and Handling files (Required)

Load data from `smartcity/`. 

To extract the data from the pdf files, use the [pypdf.pdf.PdfFileReader](https://pypdf.readthedocs.io/en/stable/index.html) class.
It will allow you to extract pages and pdf files and add them to a data structure (dataframe, list, dictionary, etc).
To install the module, use the command `pipenv install pypdf`.
You only need to handle PDF files, handling docx is not necessary.

In [2]:
import os

from project3 import *


path = os.path.abspath('./smartcity')
remove_cities = ['CA Moreno Valley', 'FL Tallahassee', 'NV Reno', 'OH Toledo', 'TX Lubbock']
filenames = [filename for filename in os.listdir(path) if filename.endswith('.pdf') and filename[:len(filename) - 4] not in remove_cities]


print(filenames)


['AK Anchorage.pdf', 'AL Birmingham.pdf', 'AL Montgomery.pdf', 'AZ Scottsdale AZ.pdf', 'AZ Tucson.pdf', 'CA Chula Vista.pdf', 'CA Fremont.pdf', 'CA Fresno.pdf', 'CA Long Beach.pdf', 'CA Oakland.pdf', 'CA Oceanside.pdf', 'CA Riverside.pdf', 'CA Sacramento.pdf', 'CA San Jose_0.pdf', 'CT NewHaven.pdf', 'DC_0.pdf', 'FL Jacksonville.pdf', 'FL Miami.pdf', 'FL Orlando.pdf', 'FL St. Petersburg.pdf', 'FL Tampa.pdf', 'GA Atlanta.pdf', 'GA Brookhaven.pdf', 'IA Des Moines.pdf', 'IN Indianapolis.pdf', 'KY Louisville.pdf', 'LA Baton Rouge.pdf', 'LA New Orleans.pdf', 'LA Shreveport.pdf', 'MA Boston.pdf', 'MD Baltimore.pdf', 'MI Detroit.pdf', 'MI Port Huron and Marysville.pdf', 'MN Minneapolis St Paul.pdf', 'MO St. Louis.pdf', 'NC Charlotte.pdf', 'NC Greensboro.pdf', 'NC Raleigh.pdf', 'NE Lincoln.pdf', 'NE Omaha.pdf', 'NJ Jersey City.pdf', 'NJ Newark.pdf', 'NV Las Vegas.pdf', 'NY Albany Troy Schenectady Saratoga Springs.pdf', 'NY Buffalo.pdf', 'NY Mt Vernon Yonkers New Rochelle.pdf', 'NY Rochester.pdf

In [3]:
raw_datas = {}
for filename in filenames:
    file_path = os.path.join(path, filename)
    city = filename[:len(filename) - 4]
    _, raw_data = get_raw_data(file_path)
    raw_datas[city] = raw_data

Create a data structure to add the city name and raw text. You can choose to split the city name from the file.

In [4]:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['city'] = np.array([filename[:len(filename) - 4] for filename in filenames])
df['raw text'] = df['city'].apply(lambda x : raw_datas[x])


## Cleaning Up PDFs (Required)

One of the more frustrating aspects of PDF is loading the data into a readable format. The first order of business will be to preprocess the data. To start, you can use code provided by Text Analytics with Python, [Chapter 3](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch03%20-%20Processing%20and%20Understanding%20Text/Ch03a%20-%20Text%20Wrangling.ipynb): [contractions.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/contractions.py) (Pages 136-137), and [text_normalizer.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/text_normalizer.py) (Pages 155-156). Feel free to download the scripts or add the code directly to the notebook (please note this code is performed on dataframes).

In addition to the data cleaning provided by the textbook, you will need to:
1. Consider removing terms that may effect clustering and topic modeling. Words to consider are cities, states, common words (smart, city, page, etc.). Keep in mind n-gram combinations are important; this can also be revisited later depending on your model's performance.
2. Check the data to remove applicants that text was not processed correctly. Do not remove more than 15 cities from the data.


#### Add the cleaned text to the structure you created.


In [5]:
df['clean text'] = df['raw text'].apply(lambda x : data_clean(x))

### Clean Up: Discussion
Answer the questions below.

#### Which Smart City applicants did you remove? What issues did you see with the documents?

[Your Answer Here]['CA Moreno Valley', 'FL Tallahassee', 'NV Reno', 'OH Toledo', 'TX Lubbock'], when we process the pdf data, the processed result is empty.

#### Explain what additional text processing methods you used and why.

[Your Answer Here]replace word in common_words = ['Smart City', 'City', 'city', 'page', 'Page', 'challenge', 'challenges', "The City's", 'The City'] with M, as they are all noun and they are common in all pdfs.(I want to use this previously, but the previous method is too time consuming, so I discard it, but I preserve this as a point)

#### Did you identify any potientally problematic words?

[Your Answer Here]Maybe some words like the, 's will be problematic.

## Experimenting with Clustering Models (Required)

Now, you'll start to explore models to find the optimal clustering model. In this section, you'll explore [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [Hierarchical](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) clustering algorithms.
Create these algorithms with k_clusters for K-means and Hierarchical.
For each cell in the table provide the [Silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), [Calinski and Harabasz score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score), and [Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score).

In each cell, create an array to store the values.
For example, 

|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means| [S,CH,DB]| [S,CH,DB] | [S,CH,DB] | [S,CH,DB] |
|Hierarchical |[S,CH,DB]| [S,CH,DB]| [S,CH,DB] | [S,CH,DB]|
|DBSCAN | X | X | X | [S,CH,DB] |



### Optimality 
You will need to find the optimal k for K-means and Hierarchical algorithms.
Find the optimality for k in the range 2 to 50.
Provide the code used to generate the optimal k and provide justification for your approach.


|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means|0.018356910057666313, 1.2987109728466466, 1.8939094988332665|-0.0023270566421437957, 1.3033397792478436, 1.2575264983050891|0.008048219490464064, 1.2739460443483683, 1.0752777864394105|45|
|Hierarchical |0.005663914856966519, 1.445434318707135, 2.8698546765571673|0.012432892332997752, 1.34875079442378, 1.8753579789793338| 0.0163241030167149, 1.3601269789601307, 1.1751112095704626|50|
|DBSCAN | X | X | X | -- |



In [47]:
import pickle
random_seed = 10086
model_name = 'model.pkl'
score = 0
score_dict = {}

In [7]:
def metric(S, CH, DB):
    return S * CH / DB

In [8]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

In [9]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = tfidf_vectorizer.fit_transform(df['raw text'])

In [10]:
print(type(X))
print(X.shape)

<class 'scipy.sparse._csr.csr_matrix'>
(64, 1000)


In [37]:
X = X.toarray()
model = ''

AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

In [48]:
k_means_scores = []
for k in range(2, 51):
    kmeans = KMeans(n_clusters=k, random_state=random_seed)
    kmeans_labels = kmeans.fit_predict(X)
    kmeans_silhouette = silhouette_score(X, kmeans_labels)
    kmeans_calinski_harabasz = calinski_harabasz_score(X, kmeans_labels)
    kmeans_davies_bouldin = davies_bouldin_score(X, kmeans_labels)
    k_means_scores.append((k, kmeans_silhouette, kmeans_calinski_harabasz, kmeans_davies_bouldin))
    
    if metric(kmeans_silhouette, kmeans_calinski_harabasz, kmeans_davies_bouldin) > score:
        score = metric(kmeans_silhouette, kmeans_calinski_harabasz, kmeans_davies_bouldin)
        
        print(f'k = {k}, kmeans')
        model = kmeans.fit(X)
print(k_means_scores)



k = 2, kmeans






k = 45, kmeans




[(2, 0.026201491122642716, 1.7771036552226012, 1.455078777195983), (3, 0.014942185219364629, 1.5290745498451392, 4.978054831108032), (4, 0.021454649442907463, 1.5094573860361247, 4.131231656887265), (5, -0.023966016541466982, 1.2915550083782397, 2.4404393884800926), (6, 0.008583656972587788, 1.314549388652139, 2.078590402731845), (7, 0.015038645153087298, 1.363497686081285, 2.979934413769687), (8, -0.017264585713397264, 1.3097791851220084, 2.7785304800651778), (9, 0.018356910057666313, 1.2987109728466466, 1.8939094988332665), (10, 0.0018499463516662342, 1.252480692040077, 1.4053614168134638), (11, -0.014644164332172298, 1.2658937438239382, 1.58752391395445), (12, -0.018691996317524155, 1.2896184629222243, 2.3697002509177056), (13, -0.009245359846450326, 1.2870986440763648, 1.326477320298282), (14, 0.014588583195611373, 1.2389564864032778, 1.4710276374450808), (15, 0.014490174510875291, 1.2400696501558455, 1.154318990118372), (16, 0.00686902704978563, 1.2481487712448018, 1.2132704128539

In [19]:
dbscan = DBSCAN(eps=1.0, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

In [49]:
hierarchical_scores = []
for k in range(2, 51):
    hierarchical = AgglomerativeClustering(n_clusters=k)
    hierarchical_labels = hierarchical.fit_predict(X)
    hierarchical_silhouette = silhouette_score(X, hierarchical_labels)
    hierarchical_calinski_harabasz = calinski_harabasz_score(X, hierarchical_labels)
    hierarchical_davies_bouldin = davies_bouldin_score(X, hierarchical_labels)
    hierarchical_scores.append((k, hierarchical_silhouette, hierarchical_calinski_harabasz, hierarchical_davies_bouldin))
    
    if metric(hierarchical_silhouette, hierarchical_calinski_harabasz, hierarchical_davies_bouldin) > score:
        score = metric(hierarchical_silhouette, hierarchical_calinski_harabasz, hierarchical_davies_bouldin)
        print(f'k = {k}, hierarchical')
        model = hierarchical.fit(X)
        
print(hierarchical_scores)

k = 49, hierarchical
k = 50, hierarchical
[(2, 0.026201491122642716, 1.7771036552226012, 1.455078777195983), (3, 0.02655114892687444, 1.7298727277279753, 4.236972073166805), (4, 0.029815274184603963, 1.6589896305702534, 3.631355625995335), (5, 0.03180593517712997, 1.5827448279884282, 3.2626516743523966), (6, 0.032556243547474384, 1.5346609448262953, 2.8638683960226476), (7, 0.021723274007616693, 1.502275888855633, 2.7216280741680388), (8, 0.0055261584395401644, 1.4729787108120982, 3.025790996539374), (9, 0.005663914856966519, 1.445434318707135, 2.8698546765571673), (10, 0.0062969388531124125, 1.4243819726689315, 2.6550799564160616), (11, 0.007636812049384014, 1.4078186465106874, 2.3795538898994617), (12, 0.008716399478678376, 1.394604012407734, 2.4610768948395196), (13, 0.009429522169521982, 1.3833687092038005, 2.321755131265557), (14, 0.009596233085723471, 1.3736798429827193, 2.2952816682247454), (15, 0.010303226075237242, 1.3651966624852006, 2.1786743659078573), (16, 0.01035485548508

#### How did you approach finding the optimal k?

[Your answer here]50

#### What algorithm do you believe is the best? Why?

[Your Answer]hierarchical, it is the result we run

### Add Cluster ID to output file
In your data structure, add the cluster id for each smart city respectively. Show the to append the clusterid code below.

In [50]:
print(type(model))
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
df['cluster_id'] = df['clean text'].apply(lambda x : model.predict(tfidf_vectorizer.fit_transform([x]).toarray())[0])
print(df['cluster_id'])

<class 'sklearn.cluster._agglomerative.AgglomerativeClustering'>


AttributeError: 'AgglomerativeClustering' object has no attribute 'predict'

### Save Model

After finding the best model, it is desirable to have a way to persist the model for future use without having to retrain. Save the model using [model persistance](https://scikit-learn.org/stable/model_persistence.html). This model should be saved in the same directory as this notebook and should be loaded as the model for your `project3.py`.

Save the model as `model.pkl`. You do not have to use pickle, but be sure to save the persistance using one of the methods listed in the link.

In [None]:
pickle.dump(model, open(model_name, 'wb'))

## Derving Themes and Concepts (Required)

Perform Topic Modeling on the cleaned data. Provide the top five words for `TOPIC_NUM = Best_k` as defined in the section above. Feel free to reference [Chapter 6](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models) for more information on Topic Modeling and Summarization.

In [74]:
import nltk

stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

def normalize_corpus(papers):
    norm_papers = []
    for paper in papers:
        paper = paper.lower()
        paper_tokens = [token.strip() for token in wtk.tokenize(paper)]
        paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
        paper_tokens = [token for token in paper_tokens if len(token) > 1]
        paper_tokens = [token for token in paper_tokens if token not in stop_words]
        paper_tokens = list(filter(None, paper_tokens))
        if paper_tokens:
            norm_papers.append(paper_tokens)
            
    return norm_papers

In [70]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1,2),
                     token_pattern=None, tokenizer=lambda doc: doc,
                     preprocessor=lambda doc: doc)

data = pd.DataFrame()
data['text'] = df['raw text'].apply(lambda x : x.split('\n'))

In [93]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

def get_topic(text):
    strs = text.split('\n')
    norm_strs = normalize_corpus(strs)
    cv = CountVectorizer(ngram_range=(1,2),
                     token_pattern=None, tokenizer=lambda doc: doc,
                     preprocessor=lambda doc: doc)
    cv_features = cv.fit_transform(norm_strs)
    
    vocabulary = np.array(cv.get_feature_names())
    lsi_model = TruncatedSVD(n_components=50, n_iter=50, random_state=42)
    document_topics = lsi_model.fit_transform(cv_features)
    topic_terms = lsi_model.components_
    
    top_terms = 2
    topic_key_term_idxs = np.argsort(-np.absolute(topic_terms), axis=1)[:, :top_terms]
    topic_keyterm_weights = np.array([topic_terms[row, columns] 
                                 for row, columns in list(zip(np.arange(50), topic_key_term_idxs))])
    topic_keyterms = vocabulary[topic_key_term_idxs]
    topic_keyterms_weights = list(zip(topic_keyterms, topic_keyterm_weights))
    for n in range(2):
        d1 = []
        d2 = []
        terms, weights = topic_keyterms_weights[n]
        term_weights = sorted([(t, w) for t, w in zip(terms, weights)], 
                              key=lambda row: -abs(row[1]))
        for term, wt in term_weights:
            if wt >= 0:
                d1.append((term, round(wt, 3)))
            else:
                d2.append((term, round(wt, 3)))

    return [d1, d2]

    

### Extract themes
Write a theme for each topic (atleast a sentence each).

[Your Answer]

[Your Answer]

[Your Answer]

### Add Topid ID to output file
Add the top two topics for each smart city to the data structure.

In [94]:
df['topicids'] = df['clean text'].apply(lambda x : get_topic(x))

AttributeError: 'CountVectorizer' object has no attribute 'feature_name'

## Gathering Applicant Summaries and Keywords (Extra Credit Section)

For each smart city applicant, gather a summary and keywords that are important to that document. Gensim is outdated; try a spacy or nltk method.



In [None]:
#the funciton is in project3.py

### Add Summaries and Keywords
Add summary and keywords to output file.

In [None]:
df['keywords'] = df['raw text'].apply(lambda x : summarize(x)[0])
df['summary'] = df['raw text'].apply(lambda x : summarize(x)[1])

## Write output data (Required)

The output data should be written as a TSV file.
You can use `to_csv` method from Pandas for this if you are using a DataFrame.

`Syntax: df.to_csv('file.tsv', sep = '')` \
`df.to_csv('smartcity_eda.tsv', sep='\t')`

In [None]:
df.to_csv('smartcity_eda.tsv', sep='\t')

# Moving Forward
Now that you have explored the dataset, take the important features and functions to create your `project3.py`.
Please refer to the project spec for more guidance.
