## 從字庫裡面把字隨機挑出來組成二字詞或三字詞，然後強迫他們滿足Zipf's law，之後用這些滿足Zipf's law 的詞組成文本

In [1]:
import random 
import bisect 
import math 
from functools import reduce
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
class ZipfGenerator: 
    """
    ZipfGenerator is an immutable type representing a Zipf probability 
    nass function
    with patameters alpha and n. 
    
    Adapted from codes copid form the flollowing online resource:
    
    http://stackoverflow.com/questions/1366984/
    generate-random-numbers-distributed-by-zipf/
    8788662#8788662

    """

    
    def __init__(self, n, alpha): 
        """Ininitialize a Zipf CDF.
         Paramerters
         n: int 
            n >= 0
         
         alpha: float 
            alpha >= 1
        """
        # Calculate Zeta values from 1 to n: 
        assert n >= 0 and alpha >= 1.0
        assert int(n) == n 
        self.n = n
        self.alpha = alpha
        tmp = [1. / (math.pow(float(i), alpha)) for i in range(1, n+1)] 
        zeta = reduce(lambda sums, x: sums + [sums[-1] + x], tmp, [0]) 

        # Store the translation map: 
        # Abstract function: representing the cumulative distribution function 
        # of a Zipf pmf 
        self.distMap = [x / zeta[-1] for x in zeta] 

    def next(self): 
        """Yield an integer between 0 and n, with probability governed by 
        Zipf distribution function specified by n and alpha.
        """
        # Take a uniform 0-1 pseudo-random value: 
        u = random.random()  

        # Translate the Zipf variable: 
        return bisect.bisect(self.distMap, u) - 1
    
    def __get_alpha(self):
        ans = self.alpha
        return ans
    
    def __get_n(self):
        ans = self.n
        return ans



In [3]:
def read_file_generate_fake(char_num = 2, out_file =  'fake1.txt', sample_word_num = 8000,
                            num_word_in_fake_scrip = 15000, 
                            alpha = 1.00001, noun = False):
    """Read "roc2.txt" file, and then generate a fake script satisfying Zipfs' law. All the words in 
    the output script share the same lenth char_num
    """
    SAMPLE_WORD_NUM = sample_word_num
    ALPHA = alpha
    NUM_WORD_IN_NOV = num_word_in_fake_scrip
    OUTPUT_FILE_NAME = out_file
    NOUN = noun
    CHAR_NUM = char_num
    
    zipf_gen =  ZipfGenerator(SAMPLE_WORD_NUM,ALPHA)
    f =  open("roc2.txt","r")

    world_list = []
    
    for line in f:
        line_split = line.split("\t")
        if NOUN:
            if 'N' in line_split[4]:
                world_list.append(line_split[3])
        else:
            #if len(line_split[3]) == CHAR_NUM:
                world_list.append(line_split[3])

    f.close()
    
    for item in world_list:
        if item == " ":
            world_list.remove(item)
    #######################################
    ###these codes are optional 
    
    tmp_list = []
    for item in world_list:
        for e in list(item):
            tmp_list.append(e)
    random.shuffle(tmp_list)
    list_2 = []
    tmp = ''
    for e in tmp_list:
        tmp = tmp + e
        if len(tmp) == char_num:
            list_2.append(tmp)
            tmp = ''
    
    world_list = list_2

    print("words in a corpus: " ,len(world_list))
    
    
    #######################################


    print("A corpus is successfully loaded.")
    
    random.shuffle(world_list)
    small_world_list = world_list[-SAMPLE_WORD_NUM:]
    target_string_list = []

    for i in range(NUM_WORD_IN_NOV):
        num = zipf_gen.next()
        w = small_world_list[num]
        target_string_list.append(w+" ")
        
    f2 = open(OUTPUT_FILE_NAME , 'w')

    word_count = 0
    for item in target_string_list:
        if word_count < 20:
            f2.write(item)
            word_count += 1
        else:
            word_count = 0
            f2.write(item+"\n")
    f2.close()
    print("A fake script is successfully created !")
    print("--------------------")
    return None

## 改動不同的參數以產生你想要的文本
#### 參數解釋：
* char_num: 假文本裡面每個單詞的字數
* out_file: 輸出的假文本的檔名
* sample_word_num: zipf's law 的參數之一
* num_word_in_fake_scrip: 輸出的假文本總共有多少單詞
* alpha: Zipf's law 的參數之二
* noun: 只選roc2.txt(中研院詞庫)中的名詞與否

In [12]:
read_file_generate_fake(char_num = 1, out_file =  'DEC16FAKE1_2.txt', sample_word_num = 8000,
                            num_word_in_fake_scrip = 20000, 
                            alpha = 1.00001, noun = False)

A Zipfs' generator is initiated.
words in a corpus:  64878
A corpus is successfully loaded.
A fake script is successfully created !
--------------------


In [13]:
read_file_generate_fake(char_num = 4, out_file =  'DEC16FAKE4_2.txt', sample_word_num = 8000,
                            num_word_in_fake_scrip = 20000, 
                            alpha = 1.00001, noun = False)

A Zipfs' generator is initiated.
words in a corpus:  16219
A corpus is successfully loaded.
A fake script is successfully created !
--------------------


In [14]:
read_file_generate_fake(char_num = 2, out_file =  'DEC16FAKE2_2.txt', sample_word_num = 8000,
                            num_word_in_fake_scrip = 20000, 
                            alpha = 1.00001, noun = False)

A Zipfs' generator is initiated.
words in a corpus:  32439
A corpus is successfully loaded.
A fake script is successfully created !
--------------------


In [15]:
read_file_generate_fake(char_num = 3, out_file =  'DEC16FAKE3_2.txt', sample_word_num = 8000,
                            num_word_in_fake_scrip = 20000, 
                            alpha = 1.00001, noun = False)

A Zipfs' generator is initiated.
words in a corpus:  21626
A corpus is successfully loaded.
A fake script is successfully created !
--------------------


In [16]:
read_file_generate_fake(char_num = 5, out_file =  'DEC16FAKE5_2.txt', sample_word_num = 8000,
                            num_word_in_fake_scrip = 20000, 
                            alpha = 1.00001, noun = False)

A Zipfs' generator is initiated.
words in a corpus:  12975
A corpus is successfully loaded.
A fake script is successfully created !
--------------------
