这里使用两种方式来实现 词频统计：
 - 1.自己用pandas结合Counter实现
 - 2.实现nltk库中的函数

# 方法1

### 函数实现

In [1]:
import pandas
from collections import Counter
from functools import reduce

def getWordFreq(inStrList, stopwords=[]):
    '''得到输入的元素为str的list的词频'''
    inStrList=[i.lower() for i in inStrList]#转换成小写，sylvia和SYLVIA应被当成同一个词
    df = pandas.DataFrame.from_dict(Counter(inStrList), orient='index').reset_index()#根据Counter统计的词频初始化df
    df = df.rename(columns={'index':'words', 0:'count'}).sort_values(["count"],ascending=False)#根据count值降序排序
    df = df[~df.words.isin(stopwords)]#去除停用词
    return df

### 函数测试

In [2]:
#获取英文的停用词
stopwords=[]
with open('stopwords-en.txt','r',encoding='utf-8') as f:
    for i in f.readlines():
        stopwords.append(i.strip())
#获取英文测试文件
testStrList=[]
with open('test-en.txt','r',encoding='utf-8') as f:
    for i in f.readlines():
        testStrList.append(i.strip())
print(testStrList)

['sylvia', 'SYLvia', 'SYLVIA', 'you', 'are', 'my', 'sweet', 'heart', 'sweet', 'sweet', 'heart', 'LOVE', 'you', 'so', 'much']


In [3]:
getWordFreq(testStrList, stopwords)#有停用词

Unnamed: 0,words,count
1,sylvia,3
7,sweet,3
3,heart,2
2,love,1


In [4]:
getWordFreq(testStrList)#无停用词

Unnamed: 0,words,count
1,sylvia,3
7,sweet,3
0,you,2
3,heart,2
2,love,1
4,my,1
5,much,1
6,are,1
8,so,1


# 方法二

In [5]:
from nltk import FreqDist
def getWordFreq2(inStrList, stopwords=[]):
    '''得到输入的元素为str的list的词频(前10个)'''
    inStrList = [i.lower() for i in inStrList if i not in stopwords]
    fdist = FreqDist(inStrList)
    return fdist.most_common(10)

In [6]:
getWordFreq2(testStrList, stopwords)#有停用词

[('sylvia', 3), ('sweet', 3), ('heart', 2), ('love', 1)]

In [7]:
getWordFreq2(testStrList)#无停用词

[('sylvia', 3),
 ('sweet', 3),
 ('you', 2),
 ('heart', 2),
 ('love', 1),
 ('my', 1),
 ('much', 1),
 ('are', 1),
 ('so', 1)]

番外：一个小测试，有时对于英文文本来说，可能想要得到是 英文词组 的词频，而不是单个词的词频，因此可实现如下函数。

In [8]:
from nltk import FreqDist, ngrams
def getWordFreq3(inStrList, n=1):
    '''得到输入的元素为str的list的词组词频(前10个)'''
    inStrList = [i.lower() for i in inStrList if i not in stopwords]
    fdist = FreqDist(ngrams(inStrList, n))
    return fdist.most_common(10)
getWordFreq3(testStrList, n=2)#感觉先去停用词好像就会影响词组？？后续再查下资料

[(('sweet', 'heart'), 2),
 (('sylvia', 'sylvia'), 2),
 (('sylvia', 'sweet'), 1),
 (('heart', 'love'), 1),
 (('sweet', 'sweet'), 1),
 (('heart', 'sweet'), 1)]