В предишната тетрадка направихме речник от последователности от символи. Нещо което не отбелязахме, но личеше в данните, е че някои последователности се появяваха в различни варианти на големи/малки букви. Например - `the`/`The`. Ще се опитаме да разерем колко често се случва и дали си заслужава споециална обработка.

In [1]:
def articles():
    with open('page_revisions_text', 'rb') as text_file:
        pending_article_data = b''
        while True:
            data = text_file.read(1024 * 1024)
            if len(data) == 0:
                break

            articles = data.split(b'\0')
            articles[0] = pending_article_data + articles[0]
            for index, article in enumerate(articles):
                if index + 1 == len(articles):
                    pending_article_data = article
                else:
                    yield article

        if len(pending_article_data) != 0:
            yield pending_article_data

In [2]:
import re

word_matcher = re.compile('\w+', re.U)

In [17]:
import collections

words = collections.defaultdict(lambda: 0)

for article in articles():
    for match in word_matcher.finditer(article.decode('utf-8')):
        words[match.group()] += 1

Нека видим малко статистика. Колко думи имаме и коя е най-дългата дума...

In [18]:
len(words)

1990305

In [32]:
max(len(word) for word in words.keys())

880

Това е изненадващо. Да погледнем по-отблизо най-дългите думи...

In [36]:
words_by_length = sorted(words.keys(), key=len, reverse=True)

In [37]:
words_by_length

['40236_rQ3D1Q26title1Q3DQ26title2Q3DRANQ2520Q2528MOVIEQ2529Q26reviewerQ3DVincentQ2520CanbyQ26pdateQ3D19860622Q26v_idQ3D40236_rQ513D1Q5126title1Q513DQ5126title2Q513DRANQ512520Q512528MOVIEQ512529Q5126reviewerQ513DVincentQ512520CanbyQ5126pdateQ513D19860622Q5126v_idQ513D40236_rQ51513D1Q515126title1Q51513DQ515126title2Q51513DRANQ51512520Q51512528MOVIEQ51512529Q515126reviewerQ51513DVincentQ51512520CanbyQ515126pdateQ51513D19860622Q515126v_idQ51513D40236_rQ5151513D1Q51515126title1Q5151513DQ51515126title2Q5151513DRANQ5151512520Q5151512528MOVIEQ5151512529Q51515126reviewerQ5151513DVincentQ5151512520CanbyQ51515126pdateQ5151513D19860622Q51515126v_idQ5151513D40236_rQ515151513D1Q5151515126title1Q515151513DQ5151515126title2Q515151513DRANQ515151512520Q515151512528MOVIEQ515151512529Q5151515126reviewerQ515151513DVincentQ515151512520CanbyQ5151515126pdateQ515151513D19860622Q5151515126v_idQ',
 '11111111111111111111111111111111111111111111111111111112222222222222222222222222222222222222222222222222222222222

Има странни неща. Няколко очевидни инстанции на Wikipedia вандализъм. И няколко на шимпанзета с пишещи машини и (безкрайно) много време. Също така има доста числа в двоично, десетично, шестнайсетично и base64 представяне. Прави впечатление и ползването на долна черта за разделител. Ползвайки `\w`, регулярният ни израз не се справя добре с това.

Ще трябва да живеем с повечето от тези. Но нека поне обработим долните черити правилно...

In [48]:
words = collections.defaultdict(lambda: 0)

for article in articles():
    for match in word_matcher.finditer(article.decode('utf-8')):
        for word in match.group().strip('_').split('_'):
            words[word] += 1

In [49]:
len(words)

1889077

Вече са с около 100 000 по-малко. От чисто любопитство - нека видим отново най-дългите...

In [50]:
words_by_length = sorted(words.keys(), key=len, reverse=True)

In [51]:
words_by_length

['11111111111111111111111111111111111111111111111111111112222222222222222222222222222222222222222222222222222222222222222222222222233333333333333333333333333333333333333333333333333333333333334444444444444444444444444444444444444444444444444444444444455555555555555555555555555555555555555555555555566666666666666666666666666677777777777777777777777777777777777777777777777777778888888888888888888888888888888888888888888888888888999999999999999999999999999999999999999999999990000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000',
 'moooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

Нека се върнем към оригиналния план. Да разгледаме различното изписване в малки/големи букви на думите.

In [57]:
lowcase_words = collections.defaultdict(lambda: set())

for word in words:
    lowcase_words[word.lower()].add(word)

In [59]:
len(lowcase_words)

1637016

In [60]:
len(words) - len(lowcase_words)

252061

In [61]:
sorted(lowcase_words.values(), key=len, reverse=True)

[{'BWBwbwbwb',
  'BWbwBwbwb',
  'BWbwbwBwb',
  'BWbwbwbwB',
  'BwBWbwbwb',
  'BwBwbWbwb',
  'BwBwbwbWb',
  'BwbWBwbwb',
  'BwbWbwBwb',
  'BwbWbwbwB',
  'BwbwBWbwb',
  'BwbwBwbWb',
  'BwbwbWBwb',
  'BwbwbWbwB',
  'BwbwbwBWb',
  'BwbwbwbWB',
  'bWBwBwbwb',
  'bWBwbwBwb',
  'bWBwbwbwB',
  'bWbWbWbwb',
  'bWbWbwbWb',
  'bWbwBwbwB',
  'bWbwbWbWb',
  'bWbwbwBwB',
  'bwBWBwbwb',
  'bwBWbwBwb',
  'bwBWbwbwB',
  'bwBwBWbwb',
  'bwBwBwbWb',
  'bwBwbWBwb',
  'bwBwbWbwB',
  'bwBwbwBWb',
  'bwBwbwbWB',
  'bwbWBwBwb',
  'bwbWBwbwB',
  'bwbWbWbWb',
  'bwbWbwBwB',
  'bwbwBWBwb',
  'bwbwBWbwB',
  'bwbwBwBWb',
  'bwbwBwbWB',
  'bwbwbWBwB',
  'bwbwbwBWB'},
 {'REDIRECT',
  'REDIRECt',
  'REDIRect',
  'REDIrECT',
  'REDiRECT',
  'REDireCt',
  'REDirect',
  'REdirect',
  'ReDIRECT',
  'Redirect',
  'redIRect',
  'redirect'},
 {'CATEGORY',
  'CAtegory',
  'CategoRY',
  'CategorY',
  'Category',
  'cAtegory',
  'categorY',
  'category'},
 {'SEX', 'SEx', 'SeX', 'Sex', 'sEX', 'sEx', 'seX', 'sex'},
 {'THE', 'THe

Да разглеждаме разликите между малки главни/букви по-отблизо.

Колко символа имаме общо?

In [77]:
letters = set()

for word in words:
    letters |= set(word)

len(letters)

9709

In [78]:
lowcase_letters = set()

for word in lowcase_words:
    lowcase_letters |= set(word)

len(lowcase_letters)

9245

Колко от буквите се срещат и като малки и като големи?

In [79]:
len(letters) - len(lowcase_letters)

464

Каква част от буквите в данните са големи/малки?

In [86]:
lower_count = 0
upper_count = 0
all_count = 0

for article in articles():
    for char in article.decode('utf-8'):
        lower_count += char.islower()
        upper_count += char.isupper()
        all_count += 1

print('All:', all_count)
print()
print('Lower:', lower_count)
print('Lower ratio:', lower_count / all_count)
print()
print('Upper:', upper_count)
print('Upper ratio:', upper_count / all_count)

All: 885428308

Lower: 587611140
Lower ratio: 0.6636462090615698

Upper: 39756496
Upper ratio: 0.04490086395566201


In [90]:
non_ascii = set()

for article in articles():
    for char in article.decode('utf-8'):
        if ord(char) > 127:
            non_ascii.add(char)
        
print('Non ASCII characters:', len(non_ascii))
del non_ascii

Non ASCII characters: 10699


In [112]:
link_pattern = re.compile('\[\[([^\]\[]*(\[\[.*\]\])*?)\]\]', re.U)
links = collections.defaultdict(lambda: 0)

for article in articles():
    for link_match in link_pattern.finditer(article.decode('utf-8')):
        links[link_match.group(1)] += 1

In [119]:
sum(len(key) * value for key, value in links.items())

172883568

In [121]:
sum(len(key) for key, value in links.items())

70026168

In [120]:
len(links)

2961604

In [136]:
s = set()
for key in links:
    if ':' in key:
        s.add(key.split(':')[0])
        
for x in s:
    print(x)


Maschera e volto dello Spiritualismo Contemporaneo
Islam and the Jews
Early Days
Molecular structure of Nucleic Acids|Molecular structure of Nucleic Acids
The revolution will not be televised
Tiny Toon Adventures
Shadowmen
Turn It Up!
Medea - Harlan's World|Medea
oldwikisource
Trinity and Beyond|Trinity and Beyond
1943
Trial of the Century
Love Undetectable
Three Imaginary Boys|Three Imaginary Boys
Common Cold (Codename
Queen
The Ring
Mythology (book)|Mythology
GUNNM
Star Trek IV
The Foot Book 
Jazz Masters
Pythagorean comma|no number of 3
Bush's Brain
Beat the Boots#As an Am|Beat the Boots I
Academy Award for Best Song|Academy Award for Best Song
Dune
Cardcaptor Sakura|Cardcaptor Sakura
Tical 2000
There and Now
[[bs
War on Terrorism
IAS 7
Job
I Ching hexagram 63|63. |
The Baseball Encyclopedia
Mr. Clemens and Mark Twain
This Time Around
British Summer Time|Summer
Greatest Hits Live
Montezuma's Revenge
Your Call Is Important To Us
Roman arithmetic#Addition|Roman arithmetic
The Legend 

Nations of Bosnia and Herzegovina|three constitutional nations
Protect Arizona Now|Protect Arizona Now
Fixed
Jerusalem
The George Tirebiter Story Chapter1
G3 Live
They Drew Fire
The Crow
Call of Cthulhu
H.P. Lovecraft’s Cthulhu
Young Hannibal
The Irish in America
Common-Civil-Calendar-and-Time|Common-Civil-Calendar-and-Time
#TOC placement
The Hedgehog, the Fox, and the Magister's Pox
Mission Impossible
BattleTech
ISO_20000|ISO/IEC 20000
ceb
finalfantasy
Masoukishin
The Pye History of British Pop Music
The Bible
Newton's_laws_of_motion#Newton's_Third_Law_
Scenes from a Memory|Metropolis Pt 2
Moneyball
The Selling of the President
Street Fighter III|Street Fighter III
Exiled
Doom II|Doom II
Angel
How to Practice
A User's Guide to the Millennium|A User's Guide to the Millennium
The Early Kings of Norway
Canada
NFL playoffs, 1993-94#AFC
Freight operators
Antiquities (Magic
Lizzie McGuire 3
Risk Junior
History of Australia since 1901| darkest period of its history
Starfleet Command (game)|S

Spider-Man and the X-Men
S Club 7
Animorphs#Animorphs_54
The Betrayal of America|The Betrayal of America
Hope Dies Last
James Dean
Okage
Final Fantasy III
Sak Pasé Presents
Marie Curie|Curie, Marie (maiden name
SYR1
Judge Dredd
Jamie Foxx Presents
Showdown
Diving into the Wreck
Outlaw Comic
Phantom of the Opera
Lipstick_Traces
Alan Turing
Over to You
CBS
Summer Solstice
Free Culture|Free Culture
The Adventures Of Jimmy Neutron
New Fun Comics|Fun
K-19
The Infinite Steve Vai
Lies My Teacher Told Me|Lies My Teacher Told Me
Formula 1
WikiFur
5150
Trantor
Streets
Taz
John 1
ISO 9000
Les Couloirs du temps
Patlabor
Gospel
Smash-Up
Get Funky With Me
m
Escalante
UFO's
Hyperspace (book)|Hyperspace
Zen
A.I. (film)|AI
Might and Magic V
La Sadies|La
Iconoclasm#The second iconoclastic period
Section Twenty-seven of the Canadian Charter of Rights and Freedoms|Section 27
Theatricals
The Chronicles of Riddick
Categories
Am I Blue?
UTC+5
Dodge City
Ecco the Dolphin
ecology#The_notion_of_biocenose
Read o

In [172]:
s = set()

for article in articles():
    for char in link_pattern.sub('', article.decode('utf-8')):
        s.add(char)

len(s)

8853

In [175]:
97 ** 2

9409