# Final Project

## Framing

**Introduction**: describe your dataset, and why you're interested in it

The dataset I will use for this study is a list of titles from a self-media company in China. More than ten thousand titles were retrieved online as well as the draft categories of these titles. 
I am interested in it because this company is the largest self-media source in China, mainly publishing pregnancy and early child-rearing short articles. Thus, through clustering these titles, I want to explore the most discussed topics in this area and validate the draft categories designed by the company.

**Research question(s)**: describe the overall research question of your project

What are the most discussed topics about pregnancy and early child-rearing through clustering the titles of article?

Can the clusters of titles validate the categorizations of the topics designed by the self-media company?

**Hypotheses**:
    * Describe 2-3 hypotheses that you're planning to test with your dataset
    * Each hypoteses should be based on academic research (cite a paper) and/or background knowledge that you have about the dataset if you've collected it yourself (e.g., if you've conducted interviews)
    * Each hypotheses should be formulated as an affirmation (and not a question)
    * You can also describe alternative hypotheses, if you think that your results could go either way (but again, have a rationale as for why)
    
    The clusters can reflect the most discussed topics.
    The clusters can validate the categorization, but more data is needed, such as the content of each article, in order to accurately validate the categorization.

**Results**:
    * how are you planning to test each hypothesis? What models are you thinking of using?
    * what are the best results you can hope for? Is that interesting / relevant for other researchers?
    * what are implications of your potential findings for practioners?
    
    K-means and DBScan will be the primary methods I will use in this study.
    The best results may be the correct validation of the categorization. More interestingly, if the clusters can reflect the frequency of each category, I can conclude that which category occurs the most.
    Consequently, the frequency of each category can be reported to the company so that they can write more articles that both interest parents and did not mention frequently.

**Threads**
    * Describe issues that might arise during the analyses above
    * Come up with backup plans in case you run into theses issues
    

## Data Exploration

Describe your raw data below; provide definition / explanations for the measures you're using

## Data Cleaning

Clean you data in this section, and make sure it's ready to be analyzed for next week!

In [None]:
# Preview of all functions

#Import data
documents = import_data(Content.txt)

# Create a dataframe that is exact same as the txt
df = create_dataframe(documents)

# Clean the dataframe and create a list of titles ready to be used
wordlists = clean_data_create_wordlists(df)

# Create a seg_list with all words, a vocabulary list, and a dataframe with vocabulary and frequency of ocurrance
seg_list, vocabulary, vocab_freq_df = create_seg_list_and_vocabulary(wordlists)

# Create 100-word chunks
chunks = flatten_and_overlap(seg_list)

# Returns a dataframe with the counts of words
vector_df = docs_by_words_df(chunks, vocabulary)

# Create a function that adds one to the current cell and takes its log if the value in the cell is not zero
df_log = vector_df.applymap(one_plus_log)

# Applied the normalization
df_log = vector_normalizer(df_log)

# Compute deviation vectors of each row
vector_df = transform_deviation_vectors(df_log)

# Apply agglomerative clustering here 
ward, clf = apply_agglomertive(vector_df, 10)

# Print the top 10 words for each cluster centroid
top_words = topwords_clusters(vector_df, clf.centroids_.shape[0], clf.centroids_)

# Create a after-clustered dataframe
master_df = create_clustering_df(vector_df, chunks, ward, wordlists, df)

# Create articles count dataframe for data visulization
countid = numarticle_count_df(master_df)

# Data Visulization

In [1]:
# Import libraries
import jieba   #Chinese word segmentation tool
import pandas as pd
import os
import codecs
import csv
import re
import numpy as np
import matplotlib.pyplot as plt
import math

In [None]:
# Import data
def import_data(filename):

    # Import data
    with codecs.open('./'+str(filename), 'rb', encoding = 'utf-16') as f:
        documents = f.read()

    # Delete the column name; replace all segementations to \t
    documents = documents[36:]
    documents = [doc.replace('\n', '') for doc in documents]
    documents = [doc.replace(' "', '') for doc in documents]
    documents = [doc.replace('"', '') for doc in documents]
    documents = [doc.replace('\r', '\t') for doc in documents]
    documents = ''.join(documents)
    documents = documents.split('\t')

    return documents

documents = import_data(Content.txt)

In [3]:
# Create a dataframe that is exact same as the txt
def create_dataframe(documents):

    # Create an empty dataframe
    index = range(0,12043)
    df = pd.DataFrame(index = index, columns=['一级分类', '二级分类','三级分类','四级分类','五级分类','知识ID','知识标题'])

    # Replace each cell of the dataframe with elements in documents
    a=0
    b=0
    for i, word in enumerate(documents):    
        if documents[i] == '孕期' or documents[i] == '全龄' or documents[i] == '原创栏目' or documents[i] == '备孕' or documents[i] == '育儿':
            b=0
            a+=1
            df.iloc[a,b] = documents[i]
        else:
            b+=1
            df.iloc[a,b] = documents[i]
   
    return df

df = create_dataframe(documents)

In [4]:
# Clean the dataframe and create a list of titles ready to be used
def clean_data_create_wordlists(df):

    # Delete the row without data
    df = df.drop(df.index[0])
    df = df.drop(df.index[12043:])
    df.head()
    df.tail()

    # Create a word list of all titles
    wordlists = []
    for i in df.index:
        title = df.loc[i, '知识标题']
        wordlists.append(title)

    # Clean punctuation
    punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
                   '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
                   '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
                   '_', '^', '`', '{', '}', '|', '~', '−', 
                   '，', '。', '！', '“', '”', '￥', "‘" , "’", '（',
                   '）', '；', '：', '——', '《', '》', '？', '【',
                   '】', '—', '……', '「', '」', '、', '＆', '＋',
                   '－', '５', '＝', '＞',  'Ｎ', '｜', '～']

    with codecs.open('chinese_stopwords_list_text.txt', 'rb', encoding = 'utf-8') as f:
        stopwords = f.read()
    stopwords = stopwords[1:]
    stopwords = stopwords.split('\n')

    # Chinese words segmentation using jieba
    for i,topic in enumerate(wordlists):
        topic = jieba.cut(topic)
        topic_joined = ' '.join(topic)
        topic_split = topic_joined.split()
        wordlists[i] = topic_split

    # Delete numbers
    for i,topic in enumerate(wordlists):
        for n, word in enumerate(topic):
            for num in range(100):
                word = word.replace(str(num), '')
            for punc in punctuation:
                word = word.replace(punc, '')
            if word in stopwords:
                word = ''
            topic[n] = word
        topic = ' '.join(topic).split()
        wordlists[i] = topic
        
    return wordlists

wordlists = clean_data_create_wordlists(df)

In [None]:
# Create a seg_list with all words, a vocabulary list, and a dataframe with vocabulary and frequency of ocurrance
def create_seg_list_and_vocabulary(wordlists):
    
    # Create seg_list with all words in wordlists
    seg_list = []
    for wordlist in wordlists:
        for word in wordlist:
            seg_list.append(word)

    # Get vocabulary
    def get_vocabulary(seg_list):
        voc = []
        for word in seg_list:
            if word not in voc: 
                voc.append(word)
        voc = list(set(voc))
        voc.sort()
        return voc

    # Then print the length of the vocabulary 
    vocabulary = get_vocabulary(seg_list)
    print('The length of vocabulary is ', end=' ')
    print(len(vocabulary))

    # Create vocabulary frequency dataframe
    vocab_freq_df = pd.DataFrame(index = range(len(vocabulary)), columns = ['vocabulary','count'])
    for i, word in enumerate(vocabulary):
        vocab_freq_df.vocabulary[i] = word
        vocab_freq_df['count'][i] = seg_list.count(word)

    # Sort vocab_freq_df
    vocab_freq_df = vocab_freq_df.sort_values('count', ascending = False)
    
    return seg_list, vocabulary, vocab_freq_df

seg_list, vocabulary, vocab_freq_df = create_seg_list_and_vocabulary(wordlists)

In [None]:
# Create 100-word chunks
def flatten_and_overlap(seg_list, window_size=100, overlap=25):
    
    # create the list of overlapping documents
    new_list_of_documents = []

    # create chunks of 100 words
    high = window_size
    while high < len(seg_list):
        low = high - window_size
        new_list_of_documents.append(seg_list[low:high])
        high += overlap
    return new_list_of_documents

chunks = flatten_and_overlap(seg_list)

In [None]:
# Returns a dataframe with the counts of words
def docs_by_words_df(chunks, vocabulary):
    vector_df = pd.DataFrame(0, index=np.arange(len(chunks)), columns=vocabulary)
    
    # fill out the matrix with counts
    for i,chunk in enumerate(chunks):
        for word in chunk:
            if word in vector_df.columns: 
                vector_df.loc[i,word] += 1
            
    return vector_df

vector_df = docs_by_words_df(chunks, vocabulary)

In [None]:
# Create a function that adds one to the current cell and takes its log if the value in the cell is not zero
def one_plus_log(cell):
    if cell != 0: 
        return 1 + math.log(cell)
    else:
        return 0

# Apply the function above to each cell of the table
df_log = vector_df.applymap(one_plus_log)

In [None]:
# Applied the normalization
def vector_normalizer(df_log):

    from sklearn.preprocessing import Normalizer

    scaler = Normalizer()
    df_log[df_log.columns] = scaler.fit_transform(df_log[df_log.columns])
    
    return df_log

df_log = vector_normalizer(df_log)

In [None]:
# Compute deviation vectors of each row
def vector_length(u):
    return np.sqrt(np.dot(u, u))

def length_norm(u): 
    return u / vector_length(u)

def transform_deviation_vectors(vector_df):
    
    # get the numpy matrix from the df
    matrix = vector_df.values
    
    # compute the sum of the vectors
    v_sum = np.sum(matrix, axis=0)
    
    # normalize this vector (find its average)
    v_avg = length_norm(v_sum)
    
    # we iterate through each vector
    for row in range(df_log.shape[0]):
        
        # this is one vector (row
        v_i = matrix[row,:]
        
        # we subtract its component along v_average
        scalar = np.dot(v_i,v_avg)
        sub = v_avg * scalar
        
        # we replace the row by the deviation vector
        matrix[row,:] = length_norm(v_i - sub)
    
    return vector_df

vector_df = transform_deviation_vectors(df_log)

In [None]:
# Apply agglomerative clustering here 

def apply_agglomertive(vector_df, clustersnum):
    
    from sklearn.cluster import AgglomerativeClustering
    from sklearn.neighbors.nearest_centroid import NearestCentroid

    ward = AgglomerativeClustering(n_clusters=clustersnum, linkage='ward').fit(vector_df.values)
    label = ward.labels_
    print("Number of points: " + str(label.size))
    
    # compute the center of the cluster
    clf = NearestCentroid()
    clf.fit(vector_df.values, label)

    return ward, clf

ward, clf = apply_agglomertive(vector_df, 10)

In [None]:
# Print the top 10 words for each cluster centroid
def topwords_clusters(vector_df, n_clusters, centroids, n_words=30, printed=True):   
    # try to get the most informative words of each cluster
    words = {}
    vocabulary = vector_df.columns
    for n in range(n_clusters):
        words[n] = []
        if printed: print('CLUSTER ' + str(n+1) + ': ', end='')
        arr = centroids[n]
        indices = arr.argsort()[-n_words:]
        for i in indices:
            if printed: print(vocabulary[i], end=', '),
            words[n].append(vocabulary[i])
        print('')
    return top_words

top_words = topwords_clusters(vector_df, clf.centroids_.shape[0], clf.centroids_)

In [None]:
# Create a after-clustered dataframe
def create_clustering_df(vector_df, chunks, ward, wordlists, df):

    #Create indices, list_of_chunks, labels, palette
    from bokeh.palettes import Category10

    indices = vector_df.index
    list_of_chunks = []
    for chunk in chunks:
        chunk=' '.join(chunk)
        list_of_chunks.append(chunk)
    labels = ward.labels_ +1 
    palette = []
    for label in labels:
        color = Category10[10][label-1]
        palette.append(color)
    spectopic = dict([(0,'Child general development'),
                (1,'Early rearing and support'),
                (2,'Newborn disease and prevention'),
                (3,'Newborn food and nurition'),
                (4,'Fetus growth'),
                (5,'Child cognitive and social-emotional development'),
                (6,'Preparation for pregnancy and childbirth'),
                (7,'Diease and physical change during pregnancy'),
                (8,'Newborn growth indicator'),
                (9,'Post-childbirth nurition and food') ])
    clustername = []
    for label in labels:
        clustern = spectopic[label-1]
        clustername.append(clustern)

    #Create lists of content for dataframe
    知识标题 = []
    一级分类 = []
    二级分类 = []
    三级分类 = []
    四级分类 = []
    知识ID = []
    current_topic = 0
    nextstart_topic = 0
    for i,chunk in enumerate(chunks):
        title = []
        firstc = []
        secondc = []
        thirdc = []
        forthc = []
        titleid = []
        current_topic = nextstart_topic
        n=0
        while n < 100:
            if chunk[n] in wordlists[current_topic]:
                n+=1
                if n == 26:
                    nextstart_topic = current_topic
            else:
                title.append(df.loc[current_topic+1, '知识标题'])
                if df.loc[current_topic+1, '一级分类'] not in firstc:
                    firstc.append(df.loc[current_topic+1, '一级分类'])
                if df.loc[current_topic+1, '二级分类'] not in secondc:
                    secondc.append(df.loc[current_topic+1, '二级分类'])
                if df.loc[current_topic+1, '三级分类'] not in thirdc:
                    thirdc.append(df.loc[current_topic+1, '三级分类'])
                if df.loc[current_topic+1, '四级分类'] not in forthc:
                    forthc.append(df.loc[current_topic+1, '四级分类'])
                titleid.append(df.loc[current_topic+1, '知识ID'])
                current_topic += 1
        知识标题.append(title)
        一级分类.append(firstc)
        二级分类.append(secondc)
        三级分类.append(thirdc)
        四级分类.append(forthc)
        知识ID.append(titleid)

    # Create dataframe
    master = {'indices': indices,
              'chunk': list_of_chunks, 
              'cluster': labels,
              'clustername': clustername,
              'firstca': 一级分类, 
              'secondca': 二级分类, 
              'thirdca': 三级分类, 
              'forthca': 四级分类,
              'titleID': 知识ID,
              'title': 知识标题, 
              'palette': palette }
    master_df = pd.DataFrame(master)
    master_df.head()

    return master_df

master_df = create_clustering_df(vector_df, chunks, ward, wordlists, df)

In [None]:
# Create articles count dataframe

def numarticle_count_df(master_df):

    countid = pd.DataFrame(index = [1,2,3,4,5,6,7,8,9,10], columns=['cluster','idcount','clustername_zh','clustername_en','topwords'])
    clustername_zh = ['儿童总体发展','启蒙教育与辅助','新生儿疾病安全预防与','新生儿事务与营养','胎儿与新生儿生长发育',
                      '儿童认知和社会性情感发展','备孕与生产','孕期症状和变化','新生儿成长指标','产后营养与饮食']
    clustername_en = ['Child general development','Early rearing and support','Newborn illness and prevention',
                      'Newborn food and nurition','Fetus growth','Child cognitive and social-emotional development',
                      'Preparation for pregnancy and childbirth','Diease and physcial change during pregnancy','Newborn growth indicator',
                      'Post-childbirth nurition and food']

    titlelist = [[ ],[ ],[ ],[ ],[ ],[ ],[ ],[ ],[ ],[ ]]

    for eachindex in master_df.index:
        clusternum = master_df.loc[eachindex, 'cluster']
        titleID = master_df.loc[eachindex, 'titleID']
        for eachid in titleID:
            if eachid not in titlelist[clusternum-1]:
                titlelist[clusternum-1].append(eachid)

    for i, clusterlist in enumerate(titlelist):
        countid.loc[i+1,'cluster'] = i+1
        countid.loc[i+1,'idcount'] = len(clusterlist)
        countid.loc[i+1,'clustername_zh'] = clustername_zh[i]
        countid.loc[i+1,'clustername_en'] = clustername_en[i]
        countid.loc[i+1,'topwords'] = top_words[i]

    return countid

countid = numarticle_count_df(master_df)

In [None]:
#Visulization English
from bokeh.plotting import ColumnDataSource, figure,show
from bokeh.io import output_notebook, output_file, curdoc
from bokeh.models import HoverTool, Select, Slider
from bokeh.layouts import row, column, gridplot
from bokeh.models.widgets import DataTable, DateFormatter, TableColumn, PreText

#interactive cluster display
source = ColumnDataSource(master_df)
plot = figure(plot_width = 1200, tools='box_select, lasso_select, pan, box_zoom', x_axis_label = 'Chunk', y_axis_label = 'Speculative topic')
plot.circle('indices', 'cluster', source=source, color = 'palette', hover_fill_color = 'firebrick', hover_alpha= 0.5, hover_line_color='white')

hover = HoverTool(tooltips = [('Original category', '@secondca')], mode = 'vline')
plot.add_tools(hover)

#histogram
histsource = ColumnDataSource(countid)
histplot = figure(title='Distribution of articles', tools='', background_fill_color="#fafafa", x_axis_label = 'Cluster', y_axis_label = 'Number of articles')
histplot.hbar(y='cluster', height=0.5, left=0, right='idcount', color="#CAB2D6", source =histsource)
histhover = HoverTool(tooltips = [('Cluster','@cluster'),('Number of articles', '@idcount'),('Cluster name', '@clustername_en')], mode = 'hline')
histplot.add_tools(histhover)

#table of titles
columns = [
TableColumn(field="cluster", title="Cluster", width=80),
TableColumn(field="clustername_en", title="Cluster name", width=250),
TableColumn(field="idcount", title="Frequency", width=100),
]
data_table = DataTable(source=histsource, columns=columns,width=500)

#table of topwords
columns = [
TableColumn(field="cluster", title="Cluster", width=80),
TableColumn(field="clustername_en", title="Cluster name", width=250),
TableColumn(field="topwords", title="Topwords", width=770),
]
topwords_table = DataTable(source=histsource, columns=columns,width=1100)


#Text
pre = PreText(text="""Babytree titles clustering analysis

""",
width=500, height=100)


#layout and output
layout = gridplot([[pre,None],[topwords_table, None],[histplot,data_table],[plot,None]])
show(layout)
output_file('/Users/peizhiwen/Desktop/S435 Final Project/Babytree Visulization.html')