# Final Project

## Framing

**Introduction**: describe your dataset, and why you're interested in it

The dataset I will use for this study is a list of titles from a self-media company in China. More than ten thousand titles were retrieved online as well as the draft categories of these titles. 
I am interested in it because this company is the largest self-media source in China, mainly publishing pregnancy and early child-rearing short articles. Thus, through clustering these titles, I want to explore the most discussed topics in this area and validate the draft categories designed by the company.

**Research question(s)**: describe the overall research question of your project

What are the most discussed topics about pregnancy and early child-rearing through clustering the titles of article?

Can the clusters of titles validate the categorizations of the topics by the self-media company?

**Hypotheses**:
    * Describe 2-3 hypotheses that you're planning to test with your dataset
    * Each hypoteses should be based on academic research (cite a paper) and/or background knowledge that you have about the dataset if you've collected it yourself (e.g., if you've conducted interviews)
    * Each hypotheses should be formulated as an affirmation (and not a question)
    * You can also describe alternative hypotheses, if you think that your results could go either way (but again, have a rationale as for why)
    
    The clusters can reflect the most discussed topics.
    The clusters can validate the categorization, but more data is needed, such as the content of each article, in order to accurately validate the categorization.

**Results**:
    * how are you planning to test each hypothesis? What models are you thinking of using?
    * what are the best results you can hope for? Is that interesting / relevant for other researchers?
    * what are implications of your potential findings for practioners?
    
    K-means and DBScan will be the primary methods I will use in this study.
    The best results may be the correct validation of the categorization. More interestingly, if the clusters can reflect the frequency of each category, I can conclude that which category occurs the most.
    Consequently, the frequency of each category can be reported to the company so that they can write more articles that both interest parents and did not mention frequently.

**Threads**
    * Describe issues that might arise during the analyses above
    * Come up with backup plans in case you run into theses issues
    

## Data Exploration

Describe your raw data below; provide definition / explanations for the measures you're using

## Data Cleaning

Clean you data in this section, and make sure it's ready to be analyzed for next week!

In [1]:
# Import libraries
import jieba   #Chinese word segmentation tool
import pandas as pd
import os
import codecs
import csv
import re

In [2]:
os.chdir('/Users/peizhiwen/Desktop/S435 Final Project')
os.getcwd()

'/Users/peizhiwen/Desktop/S435 Final Project'

In [3]:
# Import data
# The data was original in excel and was exported to utf-16 txt file, so that it can be import in python
with codecs.open('Content.txt', 'rb', encoding = 'utf-16') as f:
    documents = f.read()

In [None]:
# Delete the column name; replace all segementations to \t
documents = documents[36:]
documents = [doc.replace('\n', '') for doc in documents]
documents = [doc.replace(' "', '') for doc in documents]
documents = [doc.replace('"', '') for doc in documents]
documents = [doc.replace('\r', '\t') for doc in documents]
documents = ''.join(documents)
documents = documents.split('\t')

In [None]:
# Create an empty dataframe
index = range(0,12050)
df = pd.DataFrame(index = index, columns=['一级分类', '二级分类','三级分类','四级分类','五级分类','知识ID','知识标题'])
df.head()

In [None]:
# Replace each cell of the dataframe with elements in documents

a=0
b=0
for i, word in enumerate(documents):    
    if documents[i] == '孕期' or documents[i] == '全龄' or documents[i] == '原创栏目' or documents[i] == '备孕' or documents[i] == '育儿':
        b=0
        a+=1
        df.iloc[a,b] = documents[i]
    else:
        b+=1
        df.iloc[a,b] = documents[i]
   

In [None]:
# Delete the row without data
df = df.drop(df.index[0])
df = df.drop(df.index[12043:])
df.head()
df.tail()

In [None]:
# Create a word list of all titles
wordlist = []
for i in df.index:
    title = df.loc[i, '知识标题']
    wordlist.append(title)
wordlist = ' '.join(wordlist)

In [None]:
# Chinese words segmentation using jieba
seg_list = jieba.cut(wordlist)
seg_list = ' '.join(seg_list)
seg_list = seg_list.split()

In [None]:
# Clean punctuations
punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', 
               '，', '。', '！', '“', '”', '￥', "‘" , "’", '（',
               '）', '；', '：', '——', '《', '》', '？', '【',
               '】', '—', '……', '「', '」', '、', '＆', '＋',
               '－', '５', '＝', '＞',  'Ｎ', '｜', '～']

for i,doc in enumerate(seg_list): 
    for punc in punctuation: 
        doc = doc.replace(punc, ' ')
    seg_list[i] = doc

print(len(seg_list))
print(' '.join(seg_list)[:1000])

In [None]:
# Remove numbers
for i,doc in enumerate(seg_list): 
    for num in range(10):
        doc = doc.replace(str(num), '')
    seg_list[i] = doc
    
print(len(seg_list))
print(' '.join(seg_list)[:1000])

In [None]:
# Import Chinese stopwords
with codecs.open('chinese_stopwords_list_text.txt', 'rb', encoding = 'utf-8') as f:
    stopwords = f.read()
stopwords = stopwords[1:]
stopwords = stopwords.split('\n')



In [None]:
# Replace stop words with space
for i,doc in enumerate(seg_list):
    for stopword in stopwords:
        doc = doc.replace(' ' + stopword + ' ', ' ')
    seg_list[i] = doc

print(len(seg_list))
print(' '.join(seg_list)[:1000])

In [None]:
seg_list = ' '.join(seg_list)
seg_list = seg_list.split()

In [None]:
# Get vocabulary
def get_vocabulary(seg_list):
    voc = []
    for word in seg_list:
        if word not in voc: 
            voc.append(word)
    voc = list(set(voc))
    voc.sort()
    return voc

# Then print the length of your vocabulary (it should be 
# around 5500 words)
vocabulary = get_vocabulary(seg_list)
print(len(vocabulary))

In [None]:
# Create 100-word chunks
def flatten_and_overlap(seg_list, window_size=100, overlap=25):
    
    # create the list of overlapping documents
    new_list_of_documents = []

    # create chunks of 100 words
    high = window_size
    while high < len(seg_list):
        low = high - window_size
        new_list_of_documents.append(seg_list[low:high])
        high += overlap
    return new_list_of_documents

chunks = flatten_and_overlap(seg_list)