# 06 | Python “黑箱”：输入与输出

This notebook uses a Python environment with a few libraries, including `dask`, all of which were specificied using a `conda` [environment.yml](../edit/environment.yml) file. To demo the environment, we'll show a simplified example of using `dask` to analyze time series data, adapted from Matthew Rocklin's excellent repo of [dask examples](https://github.com/blaze/dask-examples) — check out that repo for the full version (and many other examples).

## 输入输出基础

In [1]:
name = input('your name:')
gender = input('you are a boy?(y/n)')

########输入##########

# your name:Jack
# you are a boy?

welcome_str = 'Welcome to the matrix {prefix} {name}.'
welcome_dic = {
    'prefix': 'Mr.' if gender == 'y' else 'Mrs',
    'name': name
}

print('authorizing...')
print(welcome_str.format(**welcome_dic))

your name:Jack
you are a boy?(y/n)y
authorizing...
Welcome to the matrix Mr. Jack.


In [2]:

a = input()
# 1
b = input()
# 2

print('a + b = {}'.format(a + b))
########## 输出 ##############
# a + b = 12
print('type of a is {}, type of b is {}'.format(type(a), type(b)))
########## 输出 ##############
# type of a is <class 'str'>, type of b is <class 'str'>
print('a + b = {}'.format(int(a) + int(b)))
########## 输出 ##############
# a + b = 3

1
2
a + b = 12
type of a is <class 'str'>, type of b is <class 'str'>
a + b = 3



I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

1. 读取文件；
2. 去除所有标点符号和换行符，并把所有大写变成小写；
3. 合并相同的词，统计每个词出现的频率，并按照词频从大到小排序；
4. 将结果按行输出到文件 out.txt。

In [4]:
import re

def parse(text):
    
    # 使用正则表达时取出标点符号和换行符
    text = re.sub(r'[^\w ]', ' ', text)
    
    # 转为小写
    text = text.lower()
    
    # 生成所有单词的列表
    word_list = text.split(' ')
    
    # 去处空白单词
    word_list = filter(None, word_list)
    
    # 生成单词和词频的字典
    word_cnt = {}
    for word in word_list:
        if word not in word_cnt:
            word_cnt[word] = 0
        word_cnt[word] += 1
        
    # 按照词频排序
    sorted_word_cnt = sorted(word_cnt.items(), key=lambda x:x[1], reverse=True)
    
    return sorted_word_cnt

with open('in.txt', 'r') as fin:
    text = fin.read()
    
word_and_freq = parse(text)

with open('out.txt', 'w') as fout:
    for word, freq in word_and_freq:
        fout.write('{} {}\n'.format(word, freq))

In [2]:
import json

params = {
    'symbol': '123456',
    'type': 'limit',
    'price': 123.4,
    'amount': 23
}

params_str = json.dumps(params)

print('after json serialization')
print('type of params_str = {}, params_str = {}'.format(type(params_str), params))

original_params = json.loads(params_str)

print('after json deserialization')
print('type of original_params = {}, original_params = {}'.format(type(original_params), original_params))

before json serialization
type of params_str = <class 'str'>, params_str = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}
after json deserialization
type of original_params = <class 'dict'>, original_params = {'symbol': '123456', 'type': 'limit', 'price': 123.4, 'amount': 23}


## 思考题

1. 你能否把 NLP 例子中的 word count 实现一遍？不过这次，in.txt 可能非常非常大（意味着你不能一次读取到内存中），而 output.txt 不会很大（意味着重复的单词数量很多）。提示：你可能需要每次读取一定长度的字符串，进行处理，然后再读取下一次的。但是如果单纯按照长度划分，你可能会把一个单词隔断开，所以需要细心处理这种边界情况。

In [None]:
import re

CHUNK_SIZE = 100

def parse_word(text, last_word, word_list):
    
    # 使用正则表达时取出标点符号和换行符
    text = re.sub(r'[^\w ]', ' ', last_word + text)
    
    # 转为小写
    text = text.lower()
    
    # 生成所有单词的列表
    cur_word_list = text.split(' ')
    
    # 去处空白单词
    cur_word_list = filter(None, cur_word_list)
    
    return last_word
    
def solver():    
    with open('in.txt', 'r') as fin:
        word_list, last_word = [], ''
        while True:
            text = fin.read(CHUNK_SIZE)
            if not text:
                break #读取完毕跳出循环
            last_word = parse_word(text, last_word, word_list)
            
            

        word_and_freq = parse(text)


    
    
    
    # 生成单词和词频的字典
    word_cnt = {}
    for word in word_list:
        if word not in word_cnt:
            word_cnt[word] = 0
        word_cnt[word] += 1
        
    # 按照词频排序
    sorted_word_cnt = sorted(word_cnt.items(), key=lambda x:x[1], reverse=True)
    
    return sorted_word_cnt

    with open('out.txt', 'w') as fout:
        for word, freq in word_and_freq:
            fout.write('{} {}\n'.format(word, freq))
