# Глубинное обучение для текстовых данных, ФКН ВШЭ

## Домашнее задание 1: Text Suggestion

### Оценивание и штрафы

Максимально допустимая оценка за работу — 10 баллов. Сдавать задание после жесткого дедлайна нельзя. При сдачи решения после мягкого дедлайна за каждый день просрочки снимается по одному баллу.

Задание выполняется самостоятельно. «Похожие» решения считаются плагиатом и все задействованные студенты (в том числе те, у кого списали) не могут получить за него больше 0 баллов. Весь код должен быть написан самостоятельно. Чужим кодом для пользоваться запрещается даже с указанием ссылки на источник. В разумных рамках, конечно. Взять пару очевидных строчек кода для реализации какого-то небольшого функционала можно.

Неэффективная реализация кода может негативно отразиться на оценке. Также оценка может быть снижена за плохо читаемый код. Все ответы должны сопровождаться кодом или комментариями о том, как они были получены.

__Мягкий дедлайн: 24 нояб

__Жесткий дедлайн: 27 нояб


### О задании

В этом задании вам предстоит реализовать систему, предлагающую удачное продолжение слова или нескольких следующих слов в режиме реального времени по типу тех, которые используются в телефонах, поисковой строке или приложении почты. Полученную систему вам нужно будет обернуть в пользовательский интерфейс с помощью библиотеки [reflex](https://github.com/reflex-dev/reflex), чтобы ей можно было удобно пользоваться, а так же, чтобы убедиться, что все работает как надо. В этот раз вам не придется обучать никаких моделей, мы ограничимся n-граммной генерацией.

### Структура

Это домашнее задание состоит из двух частей предположительно одинаковых по сложности. В первой вам нужно будет выполнить 5 заданий, по итогам которых вы получите минимально рабочее решение. А во второй, пользуясь тем, что вы уже сделали реализовать полноценную систему подсказки текста с пользовательским интерфейсом. Во второй части мы никак не будем ограничивать вашу фантазию. Делайте что угодно, лишь бы получилось в результате получился удобный фреймворк. Чем лучше у вас будет результат, тем больше баллов вы получите. Если будет совсем хорошо, то мы добавим бонусов сверху по своему усмотрению.

### Оценивание
При сдаче зададания в anytask вам будет необходимо сдать весь код, а также отчет с подробным описанием техник, которые в применили для создания вашей системы. Не лишним будет также написать и о том, что у вас не получилось и почему.

За часть с заданиями можно будет получить до __5__ баллов, за отчет – до __3__ баллов, 2 балл за доп вопросы, если возникнут, если вопросов не возникло, считаем, что 2 балла вы получили 

## Часть 1

### Данные

Для получения текстовых статистик используйте датасет `emails.csv`. Вы можете найти его по [ссылке](https://disk.yandex.ru/d/ikyUhWPlvfXxCg). Он содержит более 500 тысяч электронных писем на английском языке.

In [1]:
import pandas as pd

emails = pd.read_csv('emails.csv')
len(emails)

517401

In [2]:
msg = emails.sample().iloc[0]['message']
print(msg)

Message-ID: <3583016.1075847550228.JavaMail.evans@thyme>
Date: Fri, 2 Mar 2001 01:31:00 -0800 (PST)
From: tana.jones@enron.com
To: greg.whiting@enron.com, sarah.wesner@enron.com
Subject: (01-78) MARGIN RATE CHANGE FOR HENRY HUB NATURAL GAS FUTURES
 CONTRACTS
Cc: mary.cook@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: mary.cook@enron.com
X-From: Tana Jones
X-To: Greg Whiting, Sarah Wesner
X-cc: Mary Cook
X-bcc: 
X-Folder: \Tanya_Jones_June2001\Notes Folders\Sent
X-Origin: JONES-T
X-FileName: tjones.nsf

----- Forwarded by Tana Jones/HOU/ECT on 03/02/2001 09:31 AM -----

	exchangeinfo@nymex.com
	03/02/2001 08:42 AM
		 
		 To: tana.jones@enron.com
		 cc: 
		 Subject: (01-78) MARGIN RATE CHANGE FOR HENRY HUB NATURAL GAS FUTURES 
CONTRACTS


Notice No. 01-78
March 02, 2001
TO:
ALL NYMEX DIVISION MEMBERS AND MEMBER FIRMS
ALL NYMEX DIVISION CLEARING FIRMS
ALL NYMEX DIVISION OPERATIONS MANAGERS

FROM:
Neal Wolkoff
Executive Vice Pre

Заметьте, что данные очень грязные. В каждом письме содержится различная мета-информация, которая будет только мешать при предсказании продолжения текста.

__Задание 1 (1 балл).__ Очистите корпус текстов по вашему усмотрению. В идеале обработанные тексты должны содержать только текст самого письма и ничего лишнего по типу ссылок, адресатов и прочих символов, которыми мы точно не хотим продолжать текст. Оценка будет выставляться по близости вашего результата к этому идеалу.

In [4]:
import email
import re


def get_body(raw_message):
    b = email.message_from_string(raw_message)
    body = ""

    if b.is_multipart():
        for part in b.walk():
            ctype = part.get_content_type()
            cdispo = str(part.get('Content-Disposition'))

            # skip any text/plain (txt) attachments
            if ctype == 'text/plain' and 'attachment' not in cdispo:
                body = part.get_payload(decode=True)  # decode
                break
    # not multipart - i.e. plain text, no attachments, keeping fingers crossed
    else:
        body = b.get_payload(decode=True)
    
    return body.decode("utf-8")

def clean_email_text(message):
    body = get_body(message)

    # Remove reply headers like 'Re:', 'Cc:', 'Fw:', etc.
    reply_header_pattern = r"^(Re|Fw|Fwd|Cc|Bcc|Subject|To|From|Sent)[^:]*:.*(\n|\r\n)?"
    body = re.sub(reply_header_pattern, '', body, flags=re.MULTILINE | re.IGNORECASE)

    # Remove email addresses, phone numbers, URLs, and filenames
    body = re.sub(r'http\S+|www\S+|https\S+', '', body)
    body = re.sub(r'\b\S+@\S+\b', '', body)
    body = re.sub(r'\b\d{10}\b|\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b', '', body)
    # Removes filenames with extensions
    body = re.sub(r'\b\w+\.\w+\b', '', body)
    # Remove any special characters, numbers, punctuation but retain alphabetic characters and spaces
    clean_text = re.sub(r'[^a-zA-Z\s]', ' ', body)

    # Normalize whitespace
    clean_text = re.sub(r'\s+', ' ', clean_text).strip()

    return clean_text


In [113]:
msg = emails.sample().iloc[0]['message']

display(clean_email_text(msg))
print(msg)

'AMEREX Diana Sean I am missing the following deal Enron sells to Powerex MId C light at Chris Mallory Amerex does not recognize deal BLOOMBERG All deals are PREBON Chris Mallory deal Is this a duplicate of because Prebon only recognizes deal deal Prebon says should be heavy not light hrs'

Message-ID: <24659304.1075841742070.JavaMail.evans@thyme>
Date: Wed, 28 Mar 2001 08:32:00 -0800 (PST)
From: evelyn.metoyer@enron.com
To: kate.symes@enron.com
Subject: 3/38 Checkout
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Evelyn Metoyer
X-To: Kate Symes
X-cc: 
X-bcc: 
X-Folder: \kate symes 6-27-02\Notes Folders\Deal communication\Deal discrepancies
X-Origin: SYMES-K
X-FileName: kate symes 6-27-02.nsf

AMEREX

Diana/Sean
I am missing the following deal:
1) Enron sells to Powerex MId-C light 3/29 at $140.00


Chris Mallory
Amerex does not recognize deal 563362




BLOOMBERG
All deals are o.k.





PREBON

Chris Mallory
deal 563461
Is this a duplicate of 563460 because Prebon only recognizes 1 deal

deal 563358
Prebon says should be heavy not light hrs.
















In [5]:
emails['cleaned_message'] = emails['message'].apply(clean_email_text)

In [6]:
pd.set_option('display.max_colwidth', 10000)
emails.sample(5)

Unnamed: 0,file,message,cleaned_message
717,allen-p/all_documents/204.,"Message-ID: <9320715.1075855669953.JavaMail.evans@thyme>\nDate: Wed, 26 Jul 2000 03:56:00 -0700 (PDT)\nFrom: phillip.allen@enron.com\nTo: mike.grigsby@enron.com, keith.holst@enron.com, matthew.lenhart@enron.com, \n\tfrank.ermis@enron.com, jay.reitmeyer@enron.com\nSubject: New Generation Update 7/24/00\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nX-From: Phillip K Allen\nX-To: Mike Grigsby, Keith Holst, Matthew Lenhart, Frank Ermis, Jay Reitmeyer\nX-cc: \nX-bcc: \nX-Folder: \Phillip_Allen_Dec2000\Notes Folders\All documents\nX-Origin: Allen-P\nX-FileName: pallen.nsf\n\n---------------------- Forwarded by Phillip K Allen/HOU/ECT on 07/26/2000 \n10:49 AM ---------------------------\n \n\tEnron North America Corp.\n\t\n\tFrom: Kristian J Lande 07/25/2000 02:24 PM\n\t\n\nTo: Christopher F Calger/PDX/ECT@ECT, Jake Thomas/HOU/ECT@ECT, Frank W \nVickers/HOU/ECT@ECT, Elliot Mainzer/PDX/ECT@ECT, Michael McDonald/SF/ECT@ECT, \nDavid Parquet/SF/ECT@ECT, Laird Dyer/SF/ECT@ECT, Jim Buerkle/PDX/ECT@ECT, Jim \nGilbert/PDX/ECT@ECT, Terry W Donovan/HOU/ECT@ECT, Jeff G \nSlaughter/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Tim Belden/HOU/ECT@ECT, Mike \nSwerzbin/HOU/ECT@ECT, Matt Motley/PDX/ECT@ECT, Robert Badeer/HOU/ECT@ECT, \nSean Crandall/PDX/ECT@ECT, Diana Scholtes/HOU/ECT@ECT, Tom \nAlonso/PDX/ECT@ECT, Mark Fischer/PDX/ECT@ECT, Tim Heizenrader/PDX/ECT@ECT\ncc: Phillip K Allen/HOU/ECT@ECT \nSubject: New Generation Update 7/24/00\n\n\n",Forwarded by Phillip K Allen HOU ECT on AM Enron North America Corp From Kristian J Lande PM Elliot Michael David Laird Jim Jim Terry W Jeff G Tim Mike Matt Robert Sean Diana Tom Mark Tim
87706,dean-c/inbox/90.,"Message-ID: <7429710.1075852138146.JavaMail.evans@thyme>\nDate: Mon, 29 Oct 2001 07:45:12 -0800 (PST)\nFrom: truorange@aol.com\nTo: truorange@aol.com\nSubject: True Orange, October 29, 2001, Part 2 of 3\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nX-From: TruOrange@aol.com@ENRON\nX-To: TruOrange@aol.com\nX-cc: \nX-bcc: \nX-Folder: \CDEAN (Non-Privileged)\Dean, Clint\Inbox\nX-Origin: DEAN-C\nX-FileName: CDEAN (Non-Privileged).pst\n\nPart 2\n\nRecruiting Roundup\n\nThe status of three outstanding players Texas is recruiting is rather murky\nat this time.\nLB Garnet Smith of Arlington Lamar is committed to Texas, but is setting up\nvisits to Oklahoma (he was once committed to the Sooners) and several other\nschools.\nWR Robert Timmons of Flower Mound Marcus committed to Florida and LB Michael\nWilliams of Lindale committed to Oklahoma last week, but then they both\ndecommitted.\nWilliams attended Texas A&M's game Saturday and plans to come to take an\nofficial visit to Texas the December 14 weekend.\n* * * *\nThe Longhorns' 16 commitments include nine players who are listed on one or\nmore national top 100 teams.\nThey are WR Marquis Johnson, 6-3, 200, 4.48, of Centennial High School in\nChampaign, Illinois; DE Chase Pittman, 6-5, 263, 4.7, of Shreveport Evangel;\nLB Garnet Smith, 6-3, 221, 4.54, of Arlington Lamar, DTs Sonny Davis, 6-1,\n320, 5.0, of Gulf Coast JC in Mississippi and formerly of Austin Lanier, Earl\nAnderson, 6-4, 270, 4.8, of San Marcos, Lyle Sendlein, 6-4, 260, 4.8, of\nScottsdale Chaparral, the two-time defending Class 4A champion in Arizona;\nOLs Brett Valdez, 6-4, 310, 5.1, of Brownwood and Neale Tweedie, 6-5, 260,\n4.9, of Allen; and TE David Thomas, 6-3, 210, 4.6, of Wolfforth Frenship.\nDavis made all of the top national lists last year. However, he injured a\nknee and has undergone surgery, and his coach at Gulf Coast JC says he has a\nlot of academic ground to cover to graduate.\nThe other Longhorn pledges are DT Tully Janszen, 6-4, 255, 4.78, of Keller;\nLBs Brian Robison, 6-3, 245, 4.6, of Splendora and Marcus Myers, 6-3, 220,\n4.5, of Pflugerville Connally; WR Dustin Miksch, 6-0, 165, 4.4, of Round\nRock Westwood; QB Billy Don Malone, 6-2 1/2, 185, 4.7, of Paris North\nLamar; DB Matt Melton, 5-11, 190, 4.43 of Tyler, and RB/Ath Clint Haney,\n5-11, 190, 4.27 of Smithson Valley.\n* * * *\nThe Longhorns will give about 10 to 12 more scholarships and I think most of\nthem will be awarded to members of a 20-player group that includes QB Vincent\nYoung of Houston Madison, RBs Albert Hardy of Galena Park and Selvin Young of\nHouston Jersey Village, WRs Timmons, Biren Ealy of Cypress Falls and Anthony\nWright of Klein Forest, TE Eric Winston of Midland Lee, OLs Justin Blalock of\nPlano East and Tony Ugoh of Spring Westfield, DEs Bryan Pickryl of Jenks,\nOkla., Larry Dibbles of Lancaster and Travis Leitko of The Woodlands, DTs\nRodrique Wright of Alief Hastings, Marco Martin of Mesquite and Kasey\nStuddard of Highlands Ranch, Colo., LBs Aaron Harris of North Mesquite and\nLindale's Williams, and DBs Edorian McCullough of North Garland, Brodney Pool\nof Houston Westbury and Ricky Wilson of Port Arthur Lincoln.\nIf the Longhorns finish strong, they have a very good chance to get the four\ntop players on my 25-man ""difference-maker"" list.\nThey already have No. 4 in Earl Anderson and they have very good shots at No.\n1 Wright, No. 2 Young and No. 3 Dibbles.\nIf Pikryl lived in Texas, he would be in my top 5, and he says Texas is the\nleader right now. He took his official visit the weekend of the Colorado game.\nThe Longhorns might take a kicker if they find one who is consistent at\nkicking off into the end zone. Trey DeCarlo at Carrollton Creekview is the\nbest I've seen this year at booting them deep. College kickers start five\nyards farther back but DeCarlo's kicks usually carry out of the end zone.\n* * * *\nINTERESTING RECRUITING TIDBIT: The Longhorns lost their mystique for many\nyears, but there are a few signs they are getting it back. Consider this item\nin last Friday's Chicago Tribune about UT pledge Marquis Johnson, a top 50\nnational recruit from Champaign Cen-tennial,under a brief column entitled ""8\nPlayers to watch:\n""Anytime Texas comes to Illinois to recruit a player, you know he's something\nspecial.""\n\nSubscribe To The E-Mail/Fax To Get Year-Round Football\n& Recruiting Scoops !\nSave Big As\nAn Internet Subscriber !\nWhether you live close to Austin or far away, the True Orange E-Mail/Fax\nService is the way to keep up with Longhorn football and Longhorn recruiting\n- instantly. It has about 110 to 120 timely e-mail/faxes a year, primarily\nabout football and football recruiting. To subscribe, send your check to\nTrue Orange, Box 26530, Austin, Texas 78755, and copy or clip the coupon\nbelow and fill in the blanks. If you want it mailed, or by E-Mail, just\ninclude the right numbers.\n\no I'm enclosing $99 for the 99-fax package for the next year\no I'm enclosing $130 (an $11 saving) to renew my subscription to True Orange\nand to subscribe to the 99 faxes.\no I'm enclosing $79 for the 99-fax package for the next year by E-Mail (a $20\nsaving)\no Here's $110 to renew my subscription to True Orange and to subscribe to\nthe 99 faxes by E-Mail (a $31 saving)\no Here's $99 to subscribe to True Orange via the Internet and to subscribe\nto the 99 faxes by E-Mail (a $42 saving)\n\nName\n\nFax No.\n(or E-Mail or mailing address)\n\nGame Quotes . .\n\n""We were able to spread the field today because (tight end) Bo Scaife made\nsome great catches and that opened things up. I think Bo gives us the ability\nto stretch the field, even when the ball doesn't come to him, because the\ndefense has to be aware of where he is at all times. (Quarterback) Chris\n(Simms) did a great job looking off on his big catch just before the half.""\n- Texas offensive coordinator Greg Davis\n* * * *\n""This was a great team win for us. The offense did a great job of hanging on\nthe ball and running the clock, and the defense did a good job of getting\nsome three-and-outs. That's the way it's supposed to work when you draw it\nup. I give a lot of credit to (CB) Quentin Jammer and (nickel back) Dakarai\nPearson for shutting down their big-play receivers. Jammer held (WR) Justin\nGage to two catches and Dak really did a good job on their tight end after we\ndecided to give him the primary coverage on him.""\n- UT defensive coordinator Carl Reese\n* * * *\n""I don't think I was running any harder in the fourth quarter. I think our\noffensive line just wore them down a little. Coach Brown and coach Davis told\nme I needed to take charge in the fourth quarter and I just tried to keep\nrunning and pounding as hard as I could. It feels good to be part of the\nrecord book.""\n- UT freshman RB Cedric Benson, talking about his game and the fact that he\nis the first true freshman to rush for 100 yards in three straight games at\nTexas\n* * * *\n""I was getting real clean releases. I used my speed to my advantage. I was\ngetting single coverage because they were worrying about our running game and\nour wide receivers. They were just sitting back waiting for me to come to\nthem.""\n- Longhorn TE Bo Scaife, who caught five passes for 73 yards, including ones\nfor 27 and 28 yards that set up first-half touchdowns\n* * * *\n""We won and Oklahoma lost, Virginia Tech lost and Stanford is ahead of UCLA\n(Stanford won). That's what I call a good day.""\n- UT quarterback Chris Simms\n* * * *\n""When I heard the Oklahoma score, I was a little happy. Then I heard the\nVirginia Tech score and I was a little happier, then the UCLA-Stanford game\ncame as quite a surprise. That's three teams in the top five of the BCS.""\n- Longhorn wide receiver Roy Williams\n* * * *\n""He's a really good cornerback. He covered me as well as anybody I've seen\nthis year.""\n- Missouri wide receiver Justin Gage, talking about UT cornerback Quentin\nJammer, who held him to a season low two catches for 14 yards\n\nTexas-Missouri Statistics\n\nScoring Summary\nTexas 0 14 7 14 - 35\nMissouri 0 10 0 6 - 16\n\nMU - Fredrickson 8 pass from Farmer (Hammerich kick) 11:12 2Q (11 plays, 76\nyds)\nUT - B. Johnson 5 pass from Simms (Mangum kick) 5:44 2Q (10 plays 64 yds)\nUT - Edwards 3 pass from Simms (Mangum kick) 1:50 2Q (4 plays 41 yds)\nMU - Hammerich 22 FG 0:07 2Q (10 plays, 67 yds)\nUT - Simms 1 run (Mangum kick) 8:50 3Q (13 plays 73 yds)\nUT - R. Williams 8 pass from Simms (Mangum kick) 10:27 4Q (11 plays 56 yds)\nMU - Abro 7 run (pass failed) 7:41 4Q (8 plays, 56 yds)\nUT - Robin 39 pass from Simms 4:27 4Q (5 plays, 60 yds)\n\n Official Attendance: 51,123\n\nTeam Statistics\n\n Texas Missouri\nFirst Downs 25 16\nRushing 15 8\nPassing 9 5\nPenalty 1 3\nRushing Attempts, Net Yards 46-192 26-147\nNet Yards Passing 229 97\nPasses Comp., Att., Int. 24-30-0 9-28-1\nTotal Plays, Offense 76-421 54-244\nAvg. Gain per Play 5.5 4.5\nFumbles Lost 0 of 0 0 of 1\nPenalties, Yards 8-64 5-31\nPunts, Avg. 3-29.3 5-44.2\nTime of Possession 39:22 20:38\nThird-Down Conversions 12 of 17 3 of 10\nFourth-Down Conversions 1 of 1 0 of 1\nSacks by Team, Yds Lost 1-6 1-9\n\nIndividual Statistics\nTexas\nRushing - Benson 31-157; Simms 9-27, 1 TD; Ike 1-7; B. J. Johnson 1-5;\nJeffery 1-4; Trahan 1-2; Team 1-minus 1; R. Williams 1-minus 9.\nPassing - Simms 24-30, 229 yds, 4 TD, 0 Int.\nReceiving - Scaife 5-73; Robin 3-72, 1 TD; R. Williams 5-34, 1 TD; B. J.\nJohnson 4-23, 1 TD; Jeffer...",Part years but there are a few signs they are getting it back Consider this item in last Friday s Chicago Tribune about UT pledge Marquis Johnson a top national recruit from Champaign Cen tennial under a brief column entitled Players to watch Anytime Texas comes to Illinois to recruit a player you know he s something special Subscribe To The E Mail Fax To Get Year Round Football Recruiting Scoops Save Big As An Internet Subscriber Whether you live close to Austin or far away the True Orange E Mail Fax Service is the way to keep up with Longhorn football and Longhorn recruiting instantly It has about to timely e mail faxes a year primarily about football and football recruiting To subscribe send your check to True Orange Box Austin Texas and copy or clip the coupon below and fill in the blanks If you want it mailed or by E Mail just include the right numbers o I m enclosing for the fax package for the next year o I m enclosing an saving to renew my subscription to True Orange and to subscribe to the faxes o I m enclosing for the fax package for the next year by E Mail a saving o Here s to renew my subscription to True Orange and to subscribe to the faxes by E Mail a saving o Here s to subscribe to True Orange via the Internet and to subscribe yds UT B Johnson pass from Simms Mangum kick Q plays yds UT Edwards pass from Simms Mangum kick Q plays yds MU Hammerich FG Q plays yds UT Simms run Mangum kick Q plays yds UT R Williams pass from Simms Mangum kick Q plays yds MU Abro run pass failed Q plays yds UT Robin pass from Simms Q plays yds Official Attendance Team Statistics Texas Missouri First Downs Rushing Passing Penalty Rushing Attempts Net Yards Net Yards Passing Passes Comp Att Int Third Down Conversions of of Fourth Down Conversions of of Sacks by Team Yds Lost Individual Statistics Texas Rushing Benson Simms TD Ike B J Johnson Jeffery Trahan Team minus R Williams minus Passing Simms yds TD Int Sacks Lewis minus Key Statistics Missouri had possessions and five of them were three and out series while another gained only four yards in four plays and ended with an interception The Tigers also started both halves with terrible offensive statistics making zero first downs in the first quarter and one first down in the third quarter Scouting Baylor Texas will play at Baylor at Saturday in a game that will be televised by Fox Syndication The Longhorns ranked No in both polls are while the Bears are Texas is a point favorite Baylor has lost five straight after opening the season with a victory over Arkansas State and a win over New Mexico The Bears then lost to Iowa State Texas A M Nebraska Oklahoma and Texas Tech QB Greg Cicero a former Longhorn is the Bears main offensive threat As you can see from the teams respective national rankings below Baylor has had trouble running the ball There are Division teams playing football and the Bears are th in rushing and also th in total offense and th in scoring The Baylor defense has been tough at times particulary in the loss to A M But this is a game the Longhorns should win easily Look for RB Cedric Benson Texas Baylor Offense Rushing Avg Passing Avg Total Off Avg Scoring Avg Defense Rushing Avg Passing Avg Total Def Avg Opp Scoring Avg Longhorn Notes Starting guard Antwan Kirk Hughes sprained an ankle in practice last week and was held out of the Missouri game Coach Mack Brown said Derrick Dockery and Tillman Holloway played well at guard We have three really good guards he said QB Chris Simms climbed into fourth place on the Longhorn list for touchdown passes He has after throwing four against Missouri He passed Shea Morenz and Bobby Layne who had each Peter Gardere is in third place with Major Applewhite holds the school record with career touchdown passes WR Roy Williams had five receptions and a touchdown pass to move into th and th places respectively in the UT career charts
333635,mclaughlin-e/deleted_items/304.,"Message-ID: <1241335.1075841236157.JavaMail.evans@thyme>\nDate: Tue, 26 Jun 2001 15:00:09 -0700 (PDT)\nFrom: 40enron@enron.com\nSubject: Global Risk Management Operations Organization Announcement\nCc: richard.causey@enron.com, rick.buy@enron.com\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nBcc: richard.causey@enron.com, rick.buy@enron.com\nX-From: Sally Beck- (Managing Director Energy Operations)@ENRON <IMCEANOTES-Sally+20Beck-+20+28Managing+20Director+20Energy+20Operations+29+40ENRON@ENRON.com>\nX-To: Global Risk Management Operations List@ENRON\nX-cc: Causey, Richard </O=ENRON/OU=NA/CN=RECIPIENTS/CN=RCAUSEY>, Buy, Rick </O=ENRON/OU=NA/CN=RECIPIENTS/CN=RBUY>\nX-bcc: \nX-Folder: \ExMerge - McLaughlin Jr., Errol\Deleted Items\nX-Origin: MCLAUGHLIN-E\nX-FileName: erol mclaughlin 6-26-02.PST\n\n\nI am pleased to announce the following changes in the Global Risk Management Operations organization: \n\nBob Hall will lead operations for Enron Americas. Bob previously served as one of the business controllers for gas operations for Enron Americas, with direct responsibility for Gas Logistics and Volume Management. The leadership team reporting to Bob will include Peggy Hedstrom (Calgary Operations), Jeff Gossett ( U.S. Gas Risk and South America Operations), Stacey White (U.S. Power Risk), Leslie Reeves (U.S. Gas and Power Confirmations and Settlements), and Bob Superty (U.S. Gas Logistics). \n\nBrent Price will continue to lead Operations for Enron Global Markets and fulfill his dual role as Chief Accounting Officer for Enron Global Markets. \n\nI am very pleased to announce that Kevin Sweeney will lead Operations for Enron Industrial Markets. Kevin brings to his new role extensive risk and operations experience from stints in our Houston, London and Singapore offices. Kevin fills the role that was previously held by Brenda Herod, who will now report directly to Beth Apollo. \n\nBeth Apollo will be the Operations Project Manager for the assimilation of EES Wholesale. This will include Deal Capture, Risk Analysis, Confirmations, and Global Data functions which have been moved into EWS under Global Risk Management Operations. Successfully leveraging the infrastructure to support the EES business plan will also require close coordination with Gas Logistics, Settlements and the Operational Analysis (OA) process, and Beth will actively coordinate operations efforts with these teams and the commercial teams. I have asked Beth to project manage the effort to look for opportunities to streamline and improve systems, processes and controls, and to consider alignment, where it makes sense, with existing EWS processes. Reporting to Beth will be Scott Mills, with a focus on Deal Capture, Risk Analysis and Confirmations, and Brenda Herod, with a focus on invoicing control guidelines and reporting requirements. \n\nBeth will also continue to manage the Global Services function, which includes the consolidated DPR process and Global Operations Standards, led by Shona Wilson, and the Global Data function which is lead by James Scribner.\n\nMike Jordan will continue to lead operations for Enron Europe, working closely with me and other EWS operations leads to insure the proliferation of best operational practices worldwide for Enron. \n\nThanks for your continued support especially in exploring and seizing commercial opportunities to strengthen our contributions to Enron.",I am pleased to announce the following changes in the Global Risk Management Operations organization Bob Hall will lead operations for Enron Americas Bob previously served as one of the business controllers for gas operations for Enron Americas with direct responsibility for Gas Logistics and Volume Management The leadership team reporting to Bob will include Peggy Hedstrom Calgary Operations Jeff Gossett Gas Risk and South America Operations Stacey White Power Risk Leslie Reeves Gas and Power Confirmations and Settlements and Bob Superty Gas Logistics Brent Price will continue to lead Operations for Enron Global Markets and fulfill his dual role as Chief Accounting Officer for Enron Global Markets I am very pleased to announce that Kevin Sweeney will lead Operations for Enron Industrial Markets Kevin brings to his new role extensive risk and operations experience from stints in our Houston London and Singapore offices Kevin fills the role that was previously held by Brenda Herod who will now report directly to Beth Apollo Beth Apollo will be the Operations Project Manager for the assimilation of EES Wholesale This will include Deal Capture Risk Analysis Confirmations and Global Data functions which have been moved into EWS under Global Risk Management Operations Successfully leveraging the infrastructure to support the EES business plan will also require close coordination with Gas Logistics Settlements and the Operational Analysis OA process and Beth will actively coordinate operations efforts with these teams and the commercial teams I have asked Beth to project manage the effort to look for opportunities to streamline and improve systems processes and controls and to consider alignment where it makes sense with existing EWS processes Reporting to Beth will be Scott Mills with a focus on Deal Capture Risk Analysis and Confirmations and Brenda Herod with a focus on invoicing control guidelines and reporting requirements Beth will also continue to manage the Global Services function which includes the consolidated DPR process and Global Operations Standards led by Shona Wilson and the Global Data function which is lead by James Scribner Mike Jordan will continue to lead operations for Enron Europe working closely with me and other EWS operations leads to insure the proliferation of best operational practices worldwide for Enron Thanks for your continued support especially in exploring and seizing commercial opportunities to strengthen our contributions to Enron
237586,kean-s/attachments/207.,"Message-ID: <11705705.1075850999415.JavaMail.evans@thyme>\nDate: Fri, 4 Aug 2000 02:42:00 -0700 (PDT)\nFrom: sarah.novosel@enron.com\nTo: mitchell.taylor@enron.com, ann.ballard@enron.com, dwight.larson@enron.com, \n\tsteven.kean@enron.com, richard.shapiro@enron.com, \n\tpaul.kaufman@enron.com\nSubject: Memo on Mitigation Measures\nCc: awenner@velaw.com, sangle@velaw.com, sbehrend@llgm.com, jrutkows@llgm.com, \n\ttabors@tca-us.com\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nBcc: awenner@velaw.com, sangle@velaw.com, sbehrend@llgm.com, jrutkows@llgm.com, \n\ttabors@tca-us.com\nX-From: Sarah Novosel\nX-To: Mitchell Taylor, Ann Ballard, Dwight Larson, Steven J Kean, Richard Shapiro, Paul Kaufman\nX-cc: awenner@velaw.com, sangle@velaw.com, sbehrend@llgm.com, jrutkows@llgm.com, tabors@tca-us.com\nX-bcc: \nX-Folder: \Steven_Kean_Oct2001_2\Notes Folders\Attachments\nX-Origin: KEAN-S\nX-FileName: skean.nsf\n\nAs we discussed yesterday, attached is a memorandum summarizing our \nmitigation options. Also attached is a schematic show how PGE could be \ninterconnected to Sierra through LADWP. The schematic is in Word format so I \ncould not consolidate the two documents.\n\nPlease call me if you have any questions.\n\nSarah\n\n",As we discussed yesterday attached is a memorandum summarizing our mitigation options Also attached is a schematic show how PGE could be interconnected to Sierra through LADWP The schematic is in Word format so I could not consolidate the two documents Please call me if you have any questions Sarah
144718,grigsby-m/inbox/15.,"Message-ID: <11709461.1075853108946.JavaMail.evans@thyme>\nDate: Tue, 23 Oct 2001 07:04:24 -0700 (PDT)\nFrom: t..hodge@enron.com\nTo: mike.grigsby@enron.com\nSubject: RE: Texaco\nMime-Version: 1.0\nContent-Type: text/plain; charset=us-ascii\nContent-Transfer-Encoding: 7bit\nX-From: Hodge, Jeffrey T. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JHODGE>\nX-To: Grigsby, Mike </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mgrigsb>\nX-cc: \nX-bcc: \nX-Folder: \MGRIGSB (Non-Privileged)\Inbox\nX-Origin: Grigsby-M\nX-FileName: MGRIGSB (Non-Privileged).pst\n\nMike:\n\n\tIt seems to me that we can proceed one of 2 ways. Those alternatives are as follows:\n\n\t\t1. We could write a letter to Texaco and demand payment for the withheld amount.\n\n\t\t2. If we are paying them this month, we could withhold the $1.5 million from the\n\t\t amount we are paying them. Of course, we would need to provide them some \n\t\t sort of explanation.\n\n\tMy preference at this time would be to do the first item. If you could have someone provide me with the details of the transaction they withheld the money on I could draft the letter. If that letter brought no response, then we could consider taking the action outlined in the second item.\n\n\tI trust all is well with you. I look forward to hearing from you.\n\nThanks,\n\nJeff\n\n -----Original Message-----\nFrom: \tGrigsby, Mike \nSent:\tMonday, October 22, 2001 5:00 PM\nTo:\tHodge, Jeffrey T.\nSubject:\tTexaco\n\nJeff,\n\nHow should proceed with Texaco and their 1.5 million withholding?\n\n\nThanks,\nMike",Mike It seems to me that we can proceed one of ways Those alternatives are as follows We could write a letter to Texaco and demand payment for the withheld amount If we are paying them this month we could withhold the million from the amount we are paying them Of course we would need to provide them some sort of explanation My preference at this time would be to do the first item If you could have someone provide me with the details of the transaction they withheld the money on I could draft the letter If that letter brought no response then we could consider taking the action outlined in the second item I trust all is well with you I look forward to hearing from you Thanks Jeff Original Message Jeff How should proceed with Texaco and their million withholding Thanks Mike


Для следующего задания вам нужно будет токенизировать текст. Для этого просто разбейте его по словам. Очевидно, итоговый результат будет лучше, если ваша система также будет предлагать уместную пунктуацию. Но если вы считаете, что результат получается лучше без нее, то можете удалить все небуквенные символы на этапе токенизации.

In [8]:
# emails.dropna().drop(columns=['message']).to_csv('cleaned_msgs.csv')

import json

filtered = emails.dropna().drop(columns=['message'])
filtered = filtered[filtered['cleaned_message'] != '']
filtered

with open('corpus.json', 'w') as file:
        json.dump(filtered['cleaned_message'].str.lower().str.split(' ').tolist(), file)

## Дополнение слова

Описанная система будет состоять из двух частей: дополнение слова до целого и генерация продолжения текста (или вариантов продолжений). Начнем с первой части.

В этой части вам предстоит реализовать метод дополнения слова до целого по его началу (префиксу). Для этого сперва необходимо научиться находить все слова, имеющие определенный префикс. Мы будем вызывать функцию поиска подходящих слов после каждой напечатанной пользователем буквы. Поэтому нам очень важно, чтобы поиск работал как можно быстрее. Простой перебор всех слов занимает $O(|V| \cdot n)$ времени, где $|V|$ – размер словаря, а $n$ – длина префикса. Мы же напишем [префиксное дерево](https://ru.wikipedia.org/wiki/Префиксное_дерево), которое позволяет искать слова за $O(n + m)$, где $m$ – число подходящих слов.

__Задание 2 (1 балл).__ Допишите префиксное дерево для поиска слов по префиксу. Ваше дерево должно работать за $O(n + m)$ операции, в противном случае вы не получите баллов за это задание.

In [1]:
from typing import List

class PrefixTreeNode:
    def __init__(self):
        # словарь с буквами, которые могут идти после данной вершины
        self.children: dict[str, PrefixTreeNode] = {}
        self.is_end_of_word = False


class PrefixTree:
    def __init__(self, vocabulary: List[str]):
        """
        vocabulary: список всех уникальных токенов в корпусе
        """
        self.root = PrefixTreeNode()
        for word in vocabulary:
            self.insert(word)

    def insert(self, word: str):
        """
        Inserts a word into the prefix tree.
        """
        current_node = self.root
        for char in word:
            # If character not in current node's children, add it
            if char not in current_node.children:
                current_node.children[char] = PrefixTreeNode()
            # Move to the next node
            current_node = current_node.children[char]
        # Mark the end of a word
        current_node.is_end_of_word = True

    def search_prefix(self, prefix: str) -> List[str]:
        """
        Возвращает все слова, начинающиеся на prefix
        prefix: str – префикс слова
        """

        result = []
        current_node = self.root

        for char in prefix:
            if char in current_node.children:
                current_node = current_node.children[char]
            else:
                return result

        def collect_all_words(node, path):
            if node.is_end_of_word:
                result.append(path)
            for char, child_node in node.children.items():
                collect_all_words(child_node, path + char)

        # Collect all words starting from the end of the prefix
        collect_all_words(current_node, prefix)

        return result

In [36]:
vocabulary = ['aa', 'aaa', 'abb', 'bba', 'bbb', 'bcd']
prefix_tree = PrefixTree(vocabulary)

assert set(prefix_tree.search_prefix('a')) == set(['aa', 'aaa', 'abb'])
assert set(prefix_tree.search_prefix('bb')) == set(['bba', 'bbb'])
print('success')


success


Теперь, когда у нас есть способ быстро находить все слова с определенным префиксом, нам нужно их упорядочить по вероятности, чтобы выбирать лучшее. Будем оценивать вероятность слова по частоте его встречаемости в корпусе.

__Задание 3 (1 балл).__ Допишите класс `WordCompletor`, который формирует словарь и префиксное дерево, а так же умеет находить все возможные продолжения слова вместе с их вероятностями. В этом классе вы можете при необходимости дополнительно отфильтровать слова, например, удалив все самые редкие. Постарайтесь максимально оптимизировать ваш код.

In [3]:
from typing import Tuple
from collections import Counter


class WordCompletor:
    def __init__(self, corpus: List[str]):
        """
        corpus: list – корпус текстов
        """
        counter = Counter()
        for words in corpus:
            counter.update(words)

        total = sum(counter.values())
        self.probs = {w: p/total for w, p in counter.items()}
        self.vocabulary = counter.keys()

        self.prefix_tree = PrefixTree(self.vocabulary)

    def get_words_and_probs(self, prefix: str) -> Tuple[List[str], List[float]]:
        # Find all words starting with the given prefix
        words = self.prefix_tree.search_prefix(prefix)
        probs = [self.probs[word] for word in words]
        return words, probs

In [35]:
dummy_corpus = [
    ["aa", "ab"],
    ["aaa", "abab"],
    ["abb", "aa", "ab", "bba", "bbb", "bcd"],
]

word_completor = WordCompletor(dummy_corpus)
words, probs = word_completor.get_words_and_probs('a')
words_probs = list(zip(words, probs))
assert set(words_probs) == {('aa', 0.2), ('ab', 0.2), ('aaa', 0.1), ('abab', 0.1), ('abb', 0.1)}
print('success')


success


## Предсказание следующих слов

Теперь, когда мы умеем дописывать слово за пользователем, мы можем пойти дальше и предожить ему несколько следующих слов с учетом дописанного. Для этого мы воспользуемся n-граммами и будем советовать n следующих слов. Но сперва нужно получить n-граммную модель.

Напомним, что вероятность последовательности для такой модели записывается по формуле
$$
P(w_1, \dots, w_T) = \prod_{i=1}^T P(w_i \mid w_{i-1}, \dots, w_{i-n}).
$$

Тогда, нам нужно оценить $P(w_i \mid w_{i-1}, \dots, w_{i-n})$ по частоте встречаемости n-граммы.   

__Задание 4 (1 балл).__ Напишите класс для n-граммной модели. Понятное дело, никакого сглаживания добавлять не надо, мы же не хотим, чтобы модель советовала случайные слова (хоть и очень редко).

In [27]:
from collections import defaultdict


class NGramLanguageModel:
    def __init__(self, corpus, n):
        self.ngram_counts = defaultdict(Counter)
        self.context_counts = Counter()
        self.n = n

        for words in corpus:
            for i in range(len(words)):
                for j in range(i+1, min(len(words), i+n+1)):
                    context = tuple(words[i:j])
                    word = words[j]
                    self.ngram_counts[context][word] += 1
                    self.context_counts[context] += 1



    def get_next_words_and_probs(self, prefix: list) -> Tuple[List[str], List[float]]:
        """
        Возвращает список слов, которые могут идти после prefix,
        а так же список вероятностей этих слов
        """
        context = tuple(prefix[-self.n:])
        next_words_counter = self.ngram_counts.get(context, Counter())

        if not next_words_counter:
            return [], []

        total_count = self.context_counts[context]
        next_words = list(next_words_counter.keys())
        probs = [count / total_count for count in next_words_counter.values()]

        return next_words, probs


In [34]:
dummy_corpus = [
    ['aa', 'aa', 'aa', 'aa', 'ab'],
    ['aaa', 'abab'],
    ['abb', 'aa', 'ab', 'bba', 'bbb', 'bcd']
]

n_gram_model = NGramLanguageModel(corpus=dummy_corpus, n = 2)

next_words, probs = n_gram_model.get_next_words_and_probs(['aa', 'aa'])
words_probs = list(zip(next_words, probs))

assert set(words_probs) == {('aa', 2/3), ('ab', 1/3)}
print('success')


success


Отлично, мы теперь можем объединить два метода в автоматический дописыватель текстов: первый будет дополнять слово, а второй – предлагать продолжения. Хочется, чтобы предлагался список возможных продолжений, из который пользователь сможет выбрать наиболее подходящее. Самое сложное тут – аккуратно выбирать, что показывать, а что нет.   

__Задание 5 (1 балл).__ В качестве первого подхода к снаряду реализуйте метод, возвращающий всегда самое вероятное продолжение жадным способом. Если вы справитесь, то сможете можете добавить опцию поддержки нескольких вариантов продолжений, что сделает метод гораздо лучше.

In [31]:
from typing import Union
import heapq


def get_top_n_words(words: List[str], probs: List[float], n: int) -> List[str]:
    top_n = heapq.nlargest(n, zip(probs, words))
    top_n_words = [word for _, word in top_n]
    return top_n_words


def get_top_word(words: List[str], probs: List[float]) -> str:
    max_p, max_w = 0, None
    for w, p in zip(words, probs):
        if p > max_p:
            max_p, max_w = p, w
    return max_w


class TextSuggestion:
    def __init__(self, word_completor, n_gram_model):
        self.word_completor = word_completor
        self.n_gram_model = n_gram_model

    def suggest_text(self, text: Union[str, list], n_words=3, need_correction=True, n_texts=1) -> list[list[str]]:
        """
        Возвращает возможные варианты продолжения текста (по умолчанию только один)
        
        text: строка или список слов – написанный пользователем текст
        n_words: число слов, которые дописывает n-граммная модель
        need_correction: нужно ли дополнять последнее слово
        n_texts: число возвращаемых продолжений (пока что только одно)
        
        return: list[list[srt]] – список из n_texts списков слов, по 1 + n_words слов в каждом
        Первое слово – это то, которое WordCompletor дополнил до целого.
        """
        if isinstance(text, str):
            text = text.split()

        if not text:
            return []

        suggestions = []
        last_word = text[-1]
        extended_text = text
        if need_correction:
            completion = get_top_word(*self.word_completor.get_words_and_probs(last_word))
            if completion:
                extended_text = text[:-1] + [completion]
                last_word = completion

        next_words = self.n_gram_model.get_next_words_and_probs(extended_text)
        next_words = get_top_n_words(*next_words, n_texts)
        if next_words == []:
            return []

        for n_w in next_words:
            new_ext_text = extended_text + [n_w]
            current_suggestion = [last_word, n_w]
            for _ in range(n_words-1):
                words = self.n_gram_model.get_next_words_and_probs(new_ext_text)

                next_word = get_top_word(*words)
                current_suggestion.append(next_word)
                extended_text = extended_text[1:] + [next_word]
            suggestions.append(current_suggestion)

        return suggestions

In [33]:
dummy_corpus = [
    ['aa', 'aa', 'aa', 'aa', 'ab'],
    ['aaa', 'abab'],
    ['abb', 'aa', 'ab', 'bba', 'bbb', 'bcd']
]

word_completor = WordCompletor(dummy_corpus)
n_gram_model = NGramLanguageModel(corpus=dummy_corpus, n=2)
text_suggestion = TextSuggestion(word_completor, n_gram_model)



assert text_suggestion.suggest_text(['aa', 'aa'], n_words=3, n_texts=1) == [['aa', 'aa', 'aa', 'aa']]
assert text_suggestion.suggest_text(['abb', 'aa', 'ab'], n_words=2, n_texts=1) == [['ab', 'bba', 'bbb']]
print('success')

success


In [23]:
text_suggestion.suggest_text(['aa', 'aa'], n_words=3, n_texts=1)

new_ext_text:  ['aa', 'aa', 'aa'] ['aa', 'aa']
words:  ([], [])
words:  ([], [])


[['aa', 'aa', None, None]]

In [39]:
text_suggestion = TextSuggestion(word_completor, n_gram_model)

## Часть 2

Настало время довести вашу систему до ума. В этой части вы можете модифицировать все классы по своему усмотрению и добавлять любые эвристики. Если нужно, то дополнительно обрабатывать текст и вообще делать все, что считаете нужным, __кроме использования дополнительных данных__. Главное – вы должны обернуть вашу систему в пользовательский интерфейс с помощью [reflex](https://github.com/reflex-dev/reflex). В нем можно реализовать почти любой функционал по вашему желанию.

Мы настоятельно рекомендуем вам оформить код в проект, а не писать в ноутбуке. Но если вам очень хочется писать тут, то хотя бы не меняйте код в предыдущих заданиях, чтобы его можно было нормально оценивать.

При сдаче решения прикрепите весь ваш __код__, __отчет__ по второй части и __видео__ с демонстрацией работы вашей системы. Удачи!