# Python tricks and pitfalls

В этоб блокноте содержатся полезные приемы и предостережения, которые могут вам пригодиться в нашем курсе

##  Magic functions

Узнаем список файлов в директории с блокнотом

In [1]:
ls

names.txt            stopwords.txt            war-and-peace.txt
sherlock-holmes.txt  tips-and-pitfalls.ipynb


Посмотрим на нужный нам текст

In [2]:
!head -n 19 war-and-peace.txt

﻿
The Project Gutenberg EBook of War and Peace, by Leo Tolstoy

This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.org


Title: War and Peace

Author: Leo Tolstoy

Translators: Louise and Aylmer Maude

Posting Date: January 10, 2009 [EBook #2600]

Last Updated: December 17, 2016



Измерим время, требуещееся для чтения файла и перевода текста в нижний регистр

In [3]:
%%time

with open("war-and-peace.txt", "r") as f:
    text = f.read().lower()

CPU times: user 36 ms, sys: 0 ns, total: 36 ms
Wall time: 32.9 ms


## Loops vs. list comprehensions

Удалим знаки препинания из текста, это можно сделать с использованием цикла или списочного выражения. Измерять время выполнения для для небольших функций предпочтительнее через метод timeit

In [4]:
from string import punctuation

print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [5]:
%%timeit -n 1 -r 5

processed_text = []
for letter in text:
    if letter not in punctuation:
        processed_text.append(letter)

298 ms ± 3.99 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [6]:
%%timeit -n 1 -r 5

processed_text = [letter for letter in text if letter not in punctuation]

158 ms ± 2.63 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


## Searching in list vs. searching in set 

Предположим, мы хотим убрать первые 5000 встреченных нами слов из текста

In [7]:
processed_text = ''.join([letter for letter in text if letter not in punctuation])
words = processed_text.split()

words_to_filter = words[:5000]

In [8]:
%%time

res = [w for w in words[5000:100000] if w not in words_to_filter]

CPU times: user 2.97 s, sys: 0 ns, total: 2.97 s
Wall time: 2.96 s


В таком решении есть два проблемы 

* Список может содержать повторяющиеся слова

In [9]:
%%timeit -r 5 -n 1 

ls = list(set(words_to_filter))
res = [w for w in words[5000:100000] if w not in ls]

1.16 s ± 7.14 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


* Зачем мы ищем в списке?

In [10]:
%%timeit -r 5 -n 1 

s = set(words_to_filter)
res = [w for w in words[5000:100000] if w not in s]

8.85 ms ± 2.74 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


## Counting words in the text 

Часто встречающийся способ

In [11]:
vocab1 = {}
for word in words:
    if word not in vocab1:
        vocab1[word] = 1
    else:
        vocab1[word] += 1

Вариант попроще

In [12]:
from collections import defaultdict

vocab2 = defaultdict(int)
for word in words:
    vocab2[word] += 1
    
vocab2 = dict(vocab2)

Самый простой способ

In [13]:
from collections import Counter

vocab3 = dict(Counter(words).items())

In [14]:
assert vocab1 == vocab2
assert vocab2 == vocab3

vocab = vocab1

sorted(vocab.items(), key=lambda x: x[1], reverse=True)[:20]

[('the', 34300),
 ('and', 21818),
 ('to', 16587),
 ('of', 14959),
 ('a', 10407),
 ('he', 9655),
 ('in', 8873),
 ('his', 7955),
 ('that', 7639),
 ('was', 7310),
 ('with', 5675),
 ('had', 5349),
 ('it', 4805),
 ('her', 4632),
 ('not', 4593),
 ('at', 4510),
 ('him', 4429),
 ('as', 3955),
 ('on', 3922),
 ('but', 3665)]

## Mutable objects pitfalls

Рассмотрим корпус из двух предложений и словарь из четырех слов

In [15]:
corpus = ["little tree was in the woods".split(), "tree had no leaves it had needles".split()]
vocab = {"little": 0, "tree": 1, "woods": 2, "needles": 3, "UNK": 4}

Мы можем вручную построить BoW-матрицу, начав со слова "woods" для первого документа

In [16]:
def print_bow_matrix(bow, descr=''):
    print(descr)
    for row in bow:
        print(' '.join(str(x) for x in row))
    print('\n')


bag_of_words = [[0] * len(vocab)] * len(corpus)

print_bow_matrix(bag_of_words, "Initially empty matrix")
DOC_IDX = 0
bag_of_words[DOC_IDX][vocab["woods"]] = 1
print_bow_matrix(bag_of_words, "After modifiying the first row")

Initially empty matrix
0 0 0 0 0
0 0 0 0 0


After modifiying the first row
0 0 1 0 0
0 0 1 0 0




Кажется мы заодно добавили "woods" в строку второго предложения. При работе со списками нужно соблюдать осторожность.

In [17]:
bag_of_words = [[0 for _ in range(4)] for __ in range(2)]

print_bow_matrix(bag_of_words, "Initially empty matrix")
DOC_IDX = 0
bag_of_words[DOC_IDX][vocab["woods"]] = 1
print_bow_matrix(bag_of_words, "After modifiying the first row")

Initially empty matrix
0 0 0 0
0 0 0 0


After modifiying the first row
0 0 1 0
0 0 0 0




### Copying mutable objects

Теперь предположим, что у вас появился похожий корпус, для которого вы хотите переиспользовать имеющуюся матрицу

In [18]:
new_corpus = ["little tree was in the forest".split(), "tree had no leaves it had needles".split()]

def modify_copy_bow_matrix(f):
    bag_of_words = [[0 for _ in range(4)] for __ in range(2)]
    bag_of_words[DOC_IDX][vocab["woods"]] = 1
    print_bow_matrix(bag_of_words, "Original matrix")
    bag_of_words_copy = f(bag_of_words)
    bag_of_words_copy[DOC_IDX][vocab["woods"]] = 0
    return bag_of_words, bag_of_words_copy
    

bag_of_words, bag_of_words_copy = modify_copy_bow_matrix(lambda x: x)
print_bow_matrix(bag_of_words_copy, "Modified copy")
print_bow_matrix(bag_of_words, "Original matrix after the copy was modified")

Original matrix
0 0 1 0
0 0 0 0


Modified copy
0 0 0 0
0 0 0 0


Original matrix after the copy was modified
0 0 0 0
0 0 0 0




Возможно нам поможет метод copy()

In [19]:
from copy import copy

bag_of_words, bag_of_words_copy = modify_copy_bow_matrix(lambda x: copy(x))
print_bow_matrix(bag_of_words_copy, "Modified copy")
print_bow_matrix(bag_of_words, "Original matrix")

Original matrix
0 0 1 0
0 0 0 0


Modified copy
0 0 0 0
0 0 0 0


Original matrix
0 0 0 0
0 0 0 0




In [20]:
from copy import deepcopy

bag_of_words, bag_of_words_copy = modify_copy_bow_matrix(lambda x: deepcopy(x))
print_bow_matrix(bag_of_words_copy, "Modified copy")
print_bow_matrix(bag_of_words, "Original matrix")

Original matrix
0 0 1 0
0 0 0 0


Modified copy
0 0 0 0
0 0 0 0


Original matrix
0 0 1 0
0 0 0 0




### Floating point math pitfall: underflow

Допустим, нам известно, что $p(w_i|C) = \frac{1}{10^i}$ и нам нужно вычислить $\prod_i p(w_i|C)$

In [21]:
probabilities = [10 ** (- x) for x in range(1, 30)]

In [22]:
import math

def prob_calc(probs):
    prob = 1.0
    for prob_ in probs:
        prob *= prob_
    return prob

print("The joint probability is:", prob_calc(probabilities))

The joint probability is: 0.0


In [23]:
def log_prob_calc(probs):
    prob = 0.0
    for prob_ in probs:
        prob += math.log(prob_)
    return prob

print("The logarithm of joint probability is:", log_prob_calc(probabilities))

The logarithm of joint probability is: -1001.6245154524097
