# Final Project

## Framing

**Introduction**: describe your dataset, and why you're interested in it

The dataset I will use for this study is a list of titles from a self-media company in China. More than ten thousand titles were retrieved online as well as the draft categories of these titles. 
I am interested in it because this company is the largest self-media source in China, mainly publishing pregnancy and early child-rearing short articles. Thus, through clustering these titles, I want to explore the most discussed topics in this area and validate the draft categories designed by the company.

**Research question(s)**: describe the overall research question of your project

What are the most discussed topics about pregnancy and early child-rearing through clustering the titles of article?

Can the clusters of titles validate the categorizations of the topics designed by the self-media company?

**Hypotheses**:
    * Describe 2-3 hypotheses that you're planning to test with your dataset
    * Each hypoteses should be based on academic research (cite a paper) and/or background knowledge that you have about the dataset if you've collected it yourself (e.g., if you've conducted interviews)
    * Each hypotheses should be formulated as an affirmation (and not a question)
    * You can also describe alternative hypotheses, if you think that your results could go either way (but again, have a rationale as for why)
    
    The clusters can reflect the most discussed topics.
    The clusters can validate the categorization, but more data is needed, such as the content of each article, in order to accurately validate the categorization.

**Results**:
    * how are you planning to test each hypothesis? What models are you thinking of using?
    * what are the best results you can hope for? Is that interesting / relevant for other researchers?
    * what are implications of your potential findings for practioners?
    
    K-means and DBScan will be the primary methods I will use in this study.
    The best results may be the correct validation of the categorization. More interestingly, if the clusters can reflect the frequency of each category, I can conclude that which category occurs the most.
    Consequently, the frequency of each category can be reported to the company so that they can write more articles that both interest parents and did not mention frequently.

**Threads**
    * Describe issues that might arise during the analyses above
    * Come up with backup plans in case you run into theses issues
    

## Data Exploration

Describe your raw data below; provide definition / explanations for the measures you're using

## Data Cleaning

Clean you data in this section, and make sure it's ready to be analyzed for next week!

In [1]:
# Import libraries
import jieba   #Chinese word segmentation tool
import pandas as pd
import os
import codecs
import csv
import re

In [2]:
os.chdir('/Users/peizhiwen/Desktop/S435 Final Project')
os.getcwd()

'/Users/peizhiwen/Desktop/S435 Final Project'

In [3]:
# Import data
# The data was original in excel and was exported to utf-16 txt file, so that it can be import in python
with codecs.open('Content.txt', 'rb', encoding = 'utf-16') as f:
    documents = f.read()

In [4]:
# Delete the column name; replace all segementations to \t
documents = documents[36:]
documents = [doc.replace('\n', '') for doc in documents]
documents = [doc.replace(' "', '') for doc in documents]
documents = [doc.replace('"', '') for doc in documents]
documents = [doc.replace('\r', '\t') for doc in documents]
documents = ''.join(documents)
documents = documents.split('\t')

In [5]:
# Create an empty dataframe
index = range(0,12050)
df = pd.DataFrame(index = index, columns=['一级分类', '二级分类','三级分类','四级分类','五级分类','知识ID','知识标题'])
df.head()

Unnamed: 0,一级分类,二级分类,三级分类,四级分类,五级分类,知识ID,知识标题
0,,,,,,,
1,,,,,,,
2,,,,,,,
3,,,,,,,
4,,,,,,,


In [6]:
# Replace each cell of the dataframe with elements in documents

a=0
b=0
for i, word in enumerate(documents):    
    if documents[i] == '孕期' or documents[i] == '全龄' or documents[i] == '原创栏目' or documents[i] == '备孕' or documents[i] == '育儿':
        b=0
        a+=1
        df.iloc[a,b] = documents[i]
    else:
        b+=1
        df.iloc[a,b] = documents[i]
   

KeyboardInterrupt: 

In [None]:
# Delete the row without data
df = df.drop(df.index[0])
df = df.drop(df.index[12043:])
df.head()
df.tail()

In [None]:
# Create a word list of all titles
wordlist = []
for i in df.index:
    title = df.loc[i, '知识标题']
    wordlist.append(title)
wordlist = ' '.join(wordlist)

In [None]:
# Chinese words segmentation using jieba
seg_list = jieba.cut(wordlist)
seg_list = ' '.join(seg_list)
seg_list = seg_list.split()

In [None]:
# Clean punctuations
punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', 
               '，', '。', '！', '“', '”', '￥', "‘" , "’", '（',
               '）', '；', '：', '——', '《', '》', '？', '【',
               '】', '—', '……', '「', '」', '、', '＆', '＋',
               '－', '５', '＝', '＞',  'Ｎ', '｜', '～']

for i,doc in enumerate(seg_list): 
    for punc in punctuation: 
        doc = doc.replace(punc, ' ')
    seg_list[i] = doc

print(len(seg_list))
print(' '.join(seg_list)[:1000])

In [None]:
# Remove numbers
for i,doc in enumerate(seg_list): 
    for num in range(10):
        doc = doc.replace(str(num), '')
    seg_list[i] = doc
    
print(len(seg_list))
print(' '.join(seg_list)[:1000])

In [None]:
# Import Chinese stopwords
with codecs.open('chinese_stopwords_list_text.txt', 'rb', encoding = 'utf-8') as f:
    stopwords = f.read()
stopwords = stopwords[1:]
stopwords = stopwords.split('\n')



In [None]:
# Replace stop words with space
for i,doc in enumerate(seg_list):
    for stopword in stopwords:
        doc = doc.replace(' ' + stopword + ' ', ' ')
    seg_list[i] = doc

print(len(seg_list))
print(' '.join(seg_list)[:1000])

In [None]:
seg_list = ' '.join(seg_list)
seg_list = seg_list.split()

In [None]:
# Get vocabulary
def get_vocabulary(seg_list):
    voc = []
    for word in seg_list:
        if word not in voc: 
            voc.append(word)
    voc = list(set(voc))
    voc.sort()
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)
vocabulary = get_vocabulary(seg_list)
print(len(vocabulary))

In [None]:
# Create 100-word chunks
def flatten_and_overlap(seg_list, window_size=100, overlap=25):
    
    # create the list of overlapping documents
    new_list_of_documents = []

    # create chunks of 100 words
    high = window_size
    while high < len(seg_list):
        low = high - window_size
        new_list_of_documents.append(seg_list[low:high])
        high += overlap
    return new_list_of_documents

chunks = flatten_and_overlap(seg_list)

In [None]:
# 4) Putting it together: create a function that takes a list of documents
# and a vocabulary as arguments, and returns a dataframe with the counts
# of words: 
def docs_by_words_df(chunks, vocabulary):
    vector_df = pd.DataFrame(0, index=np.arange(len(chunks)), columns=vocabulary)
    
    # fill out the matrix with counts
    for i,chunk in enumerate(chunks):
        for word in chunk:
            if word in vector_df.columns: 
                vector_df.loc[i,word] += 1
            
    return vector_df

# call the function and check that the resulting dataframe is correct
vector_df = docs_by_words_df(chunks, vocabulary)



In [None]:
# 5) create a function that adds one to the current cell and takes its log
# IF the value in the cell is not zero
def one_plus_log(cell):
    if cell != 0: 
        return 1 + math.log(cell)
    else:
        return 0

In [None]:
# 6) use the "applymap" function of the dataframe to apply the function 
# above to each cell of the table
df_log = vector_df.applymap(one_plus_log)

In [None]:
# 12) since we are working with vectors, apply the Normalizer from 
# sklearn.preprocessing to our dataframe. Print a few values 
# before and after to make sure you've applied the normalization
from sklearn.preprocessing import Normalizer

scaler = Normalizer()
df_log[df_log.columns] = scaler.fit_transform(df_log[df_log.columns])
df_log[df_log.columns[500:600]]

In [None]:
# 17) put the code above in a function that takes in a dataframe as an argument
# and computes deviation vectors of each row (=document)
def vector_length(u):
    return np.sqrt(np.dot(u, u))

def length_norm(u): 
    return u / vector_length(u)

def transform_deviation_vectors(vector_df):
    
    # get the numpy matrix from the df
    matrix = vector_df.values
    
    # compute the sum of the vectors
    v_sum = np.sum(matrix, axis=0)
    
    # normalize this vector (find its average)
    v_avg = length_norm(v_sum)
    
    # we iterate through each vector
    for row in range(df_log.shape[0]):
        
        # this is one vector (row
        v_i = matrix[row,:]
        
        # we subtract its component along v_average
        scalar = np.dot(v_i,v_avg)
        sub = v_avg * scalar
        
        # we replace the row by the deviation vector
        matrix[row,:] = length_norm(v_i - sub)
    
    return vector_df

vector_df = transform_deviation_vectors(df_log)
vector_df

In [None]:
# 1a) create a list of inertia values for k 1-10
from sklearn.cluster import KMeans

ks = list(range(1, 10))
inertias = []

for k in ks:
    
    # Create a KMeans instance with k clusters: model
    kmeans = KMeans(n_clusters=k, max_iter=1000)
    
    # Fit model to samples
    kmeans.fit(vector_df.values)
    
    # Append the inertia to the list of inertias
    inertias.append(kmeans.inertia_)

In [None]:
# 1b) plot the inertia values using matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

In [None]:
# 3a) apply k-means to our data with k=10 and print the first 10 words
# that are the most associated with each cluster centroids
# Hint: look at the cluster_centers_ of the KMeans object to find the centroids
import collections
kmeans_obj = KMeans(n_clusters=10, max_iter=1000).fit(vector_df.values)

n_words = 20
top_words = collections.defaultdict(lambda: [])

# iterate through each cluster
for n in range(kmeans_obj.n_clusters):

    print('CLUSTER ' + str(n+1) + ': ', end='')

    # get the cluster centers
    arr = kmeans_obj.cluster_centers_[n]

    # sorts the array and keep the last n words
    indices = arr.argsort()[-n_words:]

    # add the words to the list of words
    for i in indices:
        print(vocabulary[i], end=', ')
        top_words[n].append(vocabulary[i])
        
    print('')

In [None]:
#4a) plot the dendogram using the link above (method = complete)
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Calculate the linkage: mergings
mergings = linkage(vector_df.values, method='complete', )

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=vector_df.columns,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()

In [None]:
#4c) we are going to use agglomerative clustering here 
# from the sklean library 
from sklearn.cluster import AgglomerativeClustering

ward = AgglomerativeClustering(n_clusters=10, linkage='ward').fit(vector_df.values)
label = ward.labels_

print("Number of points: " + str(label.size))

In [None]:
# 4d) compute the center of the cluster
# unfortunately sklearn doesn't provide you with the centroids, but you can use the link below:
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html
from sklearn.neighbors.nearest_centroid import NearestCentroid
import numpy as np

clf = NearestCentroid()
clf.fit(vector_df.values, label)

print(clf.centroids_.shape)

In [None]:
# 4e) print the top 10 words for each cluster centroid
def visualize_clusters(vector_df, n_clusters, centroids, n_words=20, printed=True):   
    # try to get the most informative words of each cluster
    words = {}
    vocabulary = vector_df.columns
    for n in range(n_clusters):
        words[n] = []
        if printed: print('CLUSTER ' + str(n+1) + ': ', end='')
        arr = centroids[n]
        indices = arr.argsort()[-n_words:]
        for i in indices:
            if printed: print(vocabulary[i], end=', '),
            words[n].append(vocabulary[i])
        print('')
    return words

top_words = visualize_clusters(vector_df, clf.centroids_.shape[0], clf.centroids_)