# Regular Expressions Assignment

## Part 1: Dealing with Noisy Data



We use Pandas to extract information from Wikipedia page `https://en.wikipedia.org/wiki/List_of_lakes_by_area` about the largest lakes in the world. We want to extract from this web page the list of the lakes and their area.

The code below extracts the information from Wikipedia, and generates a CSV file, `largest_lakes.csv` , with the information. (You can also find the `largest_lakes.csv` file attached.)

```python
import pandas as pd
# Extract the tables that appear in the HTML page, which contain the term "Water Volume"
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_lakes_by_area', match = 'Water volume', header=0)
# Get the first table from the list of tables extracted from the HTML page, which is the one that we want
lakes = tables[0]
# Replace the character \xa0 with space
lakes.replace(to_replace = r'\xa0', value= r' ', regex=True, inplace=True)
# Save the Name and Area columns as a CSV file
lakes[['Name', 'Area']].to_csv("largest_lakes.csv", index=False)
```

Now you open the file, and you can read its contents in memory, in the `lines` list of strings (one entry per line)

In [1]:
f = open("largest_lakes.csv","r")
lines = f.read().splitlines()
f.close()

If you take a look at the extracted information though, you see that is a bit messy. You see that the names of the lakes have leftovers from footnotes, and the area column contains extra characters that we do not need.

In [None]:
lines[1:10]

Our goal is to use regular expressions to process the file and create a clean file with the name of the lake, and the area of the lake (in square miles) in the next column. The area column should be an integer (withouts commas). For example, the 10 first lines (listed above) should be transformed into:

```
Caspian Sea	143000
Superior	31700
Victoria	26590
Huron	23000
Michigan	22000
Tanganyika	12600
Baikal	12200
Great Bear Lake	12000
Malawi	11400
```

In [2]:
# Your code here (It should not be more than 10-15 lines, at most)
import re
lake = re.compile(r'[A-Z][A-Za-z ]*[a-z]')
area = re.compile(r'\d+?(?= sq)')

for line in lines[1:]:
    lake_matches = lake.finditer(line)
    area_matches = area.finditer(line.replace(",",""))
    for match in lake_matches:
        print(match.group(),end=" ")
    for match in area_matches:
        print(match.group())

Caspian Sea 143000
Superior 31700
Victoria 26590
Huron 23000
Michigan 22000
Tanganyika 12600
Baikal 12200
Great Bear Lake 12000
Malawi 11400
Great Slave Lake 10000
Erie 9900
Winnipeg 9465
Ontario 7320
Ladoga 7000
Balkhash 6300
Vostok 4800
Onega 3700
Titicaca 3232
Nicaragua 3191
Athabasca 3030
Taymyr 2700
Turkana 2473
Reindeer Lake 2440
Issyk Kul 2400
Urmia 2317
2141
Winnipegosis 2086
Albert 2046
Mweru 1980
Nettilling 1956
Nipigon 1870
Manitoba 1817
Great Salt Lake 1800
Qinghai Lake 1733
Saimaa 1700
Lake of the Woods 1680
Khanka 1620
Sarygamysh 1527
Dubawnt 1480
Van 1450
Peipus 1373
Uvs 1290
Poyang 1240
Tana 1200
Amadjuak 1203
Melville 1185
Bangweulu 1200
Dongting 1090
Kh 1070
Tonl Sap 1000
Kivu 1000
Wollaston 1035
Alakol 1020
Iliamna 1012
Hulun 903
Mistassini 902
Edward 898
Mai Ndombe 890
Nueltin 880
Tai 870
Southern Indian Lake 868
Chany 770


## Part 2: Reformatting a Data File

You are given a file `roster-f2018.txt` that contains the roster of students enrolled in the class.

The file contains three **tab-separated** columns: `Section`, `Name`, `Email`. 

* The `Section` can be either `S1` or `S2`.
* The `Name` column has the format [_lastname, firstname middlename_]. Not all students have a middle name listed.
* The email is the NYU email of the student.

In [3]:
f = open("roster-f2018.txt")
lines = f.read().splitlines()
f.close()

In [None]:
# Last 10 lines of the file
lines[-10:]

You are asked to reformat the file. The reformatted file should be tab-separated, and should include five columns:

`{section}\t{email}\t{first}\t{middle}\t{last}`

The requirements:
* The first column should be `Section`, but instead of the values `S1` and `S2`, should say `Section 1` and `Section 2`, respectively.
* The second column should be the `NetId`. For someone with the email `pi1@nyu.edu`, the NetId is `pi1`.
* The third column should be the first name of the student.
* The fourth column should be the middle name of the student (should be empty when there is no middle name)
* The fifth column should be the last name of the student

You can see an [example of the reformatted file](https://docs.google.com/spreadsheets/d/10j33VgMU6Kjf1MIUnNWpEKXLjLCNXwxjVkMwDU74n08/edit?usp=sharing).

In [4]:
# Your code here (It should not be more than 10-15 lines, at most)
import re
regex = re.compile(r'(S[12])\t([A-Z][a-z]+),([A-Z][a-z]+) ?([A-Z][a-z]*)?\t(\w+?(?=@nyu.edu))')
row_template = "{}\t{}\t{}\t{}\t{}"
for line in lines[1:]:
    matches = regex.finditer(line)
    for match in matches:
        middle = match.group(4)
        if middle is None:
            middle = ""
        row = row_template.format(match.group(1).replace("S","Section "),match.group(2),middle,match.group(3),match.group(5))
        print(row)

Section 1	Bouaoudia		Malik	mb6351
Section 1	Cai		Kent	kcc407
Section 1	Chabora		Jason	jmc1250
Section 1	Chang		Jonah	jc7015
Section 1	Chen		Amy	ac6325
Section 1	Cherevkov		Vlad	vc1238
Section 1	Fu		Judy	xf365
Section 1	Garcia		Gabriel	ggl245
Section 1	Hou		Wangrui	wh916
Section 1	Ingraham		Jess	jli232
Section 1	Khosla		Aditya	ak5562
Section 1	Khosla		Arkin	ak5635
Section 1	Kundu		Pratyush	pk1676
Section 1	Lakhotia		Akshat	al4533
Section 1	Lin		Allen	al5361
Section 1	Lin		Jonathan	jl7028
Section 1	Ling		Yuheng	yl4042
Section 1	Llacer	Orayani	Cristina	col223
Section 1	Loney		Rahul	rl2838
Section 1	Maeda		Riku	rm4467
Section 1	Nakajima		Christie	cn1095
Section 1	Ng		Nicole	nn1079
Section 1	Osufsen		Chris	cro257
Section 1	Palmer		Kenton	krp354
Section 1	Ristova		Teona	tr1328
Section 1	Rub		Rachel	rr2875
Section 1	Ryan		Oliver	orr214
Section 1	Sakhaie		Liza	ls4049
Section 1	Shah	Paras	Jainam	jps661
Section 1	Shao		Faye	js9950
Section 1	Shi		Jiaxin	js9212
Section 1	Tran		Nhi	ttt296
Section 1

## Part 3: Detecting Problematic Data Entries

We are going to process the official NYPD Complaints dataset (available from the NYC Open Data). The dataset contains all the complaints to NYPD that were reported from 2006 until today (the RPT_DT contains the date when the incident was reported.)

The code below fetches the latest version of the dataset from NYC Open Data, and creates a smaller file with just 4 columns: CMPLNT_NUM (the complaint number), RPT_DT (the date the incident was reported), CMPLNT_FR_DT (the date the incident occurred), and CMPLNT_FR_TM (the time the incident occurred).

```python
import pandas as pd
# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv
df = pd.read_csv('nypd.csv')
df[['CMPLNT_NUM','RPT_DT','CMPLNT_FR_DT','CMPLNT_FR_TM']].dropna().to_csv("nypd_short.csv.gz", index=False, compression='gzip')
del(df)
```

The code below loads the shortened file in memory (in the `lines` variable). You can take a look at the

In [5]:
import gzip
with gzip.open('nypd_short.csv.gz', 'rt') as f:
    lines = f.read().splitlines()

In [None]:
lines[-50:]

Our focus for this assignment will be the columns `CMPLNT_FR_DT` and `CMPLNT_FR_TM`, which record the date time that the crime has **occurred**. (Note that the date that the incident was reported and the date the incident has occurred are not necessarily the same, and sometimes it takes years for an incident to be reported.) The date is recorded in the MM/DD/YYYY format, and the time is recorded as a 24-hr time (00:00 to 23:59)

Unfortunately, the dataset seems to include some dates in the `CMPLNT_FR_DT` column that are incorrect, and some times in the `CMPLNT_FR_TM` that are incorrect. Your task is to write code that uses regular expressions to detect these entries, and print them out. 

* You should check the `CMPLNT_FR_DT` column for correctness. In general, any date that is not 19xx or 20xx should be marked as **definitely incorrect**. Dates that are before 1930 (i.e., almost 90 years have passed!) should also be treated as **likely incorrect**.
* You should also check the `CMPLNT_FR_TM` column and detect any times that are not following the 24-hr time format (00:00 to 23:59).

In [6]:
# Detect incorrect dates and print out the incorrect lines
import re

for line in lines[1:]:
    match = re.search(r'(19\d{2})|(20\d{2})',line.split(",")[2])
    match_likely = re.search(r'(19[3-9][0-9])|(20\d{2})',line.split(",")[2])
    if not match:
        print("Definitely incorrect: "+line)
        continue
    if not match_likely:
        print("Likely incorrect: "+line)

Definitely incorrect: 211843983,12/10/2015,12/04/1015,16:45:00
Definitely incorrect: 131106711,12/10/2015,12/04/1015,12:30:00
Definitely incorrect: 821425869,12/01/2015,11/25/1015,14:30:00
Definitely incorrect: 161924074,11/23/2015,09/26/1015,12:11:00
Definitely incorrect: 414788103,11/05/2015,10/27/1015,19:30:00
Definitely incorrect: 148685327,10/23/2015,10/17/1015,16:00:00
Likely incorrect: 962210034,10/06/2015,09/20/1915,18:00:00
Definitely incorrect: 704380800,10/01/2015,09/16/1015,12:00:00
Likely incorrect: 645928669,09/23/2015,09/23/1915,14:15:00
Likely incorrect: 138433797,08/21/2015,08/20/1915,07:45:00
Likely incorrect: 880111721,08/05/2015,08/04/1915,22:00:00
Likely incorrect: 126257773,08/04/2015,08/04/1915,11:57:00
Likely incorrect: 358619818,05/15/2015,09/20/1910,00:01:00
Likely incorrect: 538002115,03/08/2015,08/07/1915,15:00:00
Likely incorrect: 803922313,02/24/2015,01/29/1915,06:00:00
Likely incorrect: 335984386,01/19/2015,12/06/1914,16:10:00
Likely incorrect: 191430914,

Definitely incorrect: 851853808,03/17/2017,02/26/1017,17:00:00
Definitely incorrect: 415869603,03/03/2017,02/19/1017,18:15:00
Likely incorrect: 416577648,02/08/2017,09/20/1916,12:00:00
Definitely incorrect: 602703627,01/28/2017,01/16/1017,15:10:00
Definitely incorrect: 847834014,01/10/2017,12/08/1017,23:30:00


In [7]:
# Detect incorrect times and print out the incorrect lines
import re
for line in lines:
    match = re.search(r'(0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]',line.split(",")[3])
    if not match:
        print("Incorrect time: "+line)

Incorrect time: CMPLNT_NUM,RPT_DT,CMPLNT_FR_DT,CMPLNT_FR_TM
Incorrect time: 377121542,09/01/2009,08/31/2009,24:00:00
Incorrect time: 238162576,08/31/2009,08/31/2009,24:00:00
Incorrect time: 880282041,08/27/2009,08/18/2009,24:00:00
Incorrect time: 821869301,08/27/2009,08/18/2009,24:00:00
Incorrect time: 262838873,08/26/2009,01/09/2009,24:00:00
Incorrect time: 347926023,08/25/2009,08/24/2009,24:00:00
Incorrect time: 651492248,08/17/2009,08/16/2009,24:00:00
Incorrect time: 685971531,08/17/2009,07/21/2009,24:00:00
Incorrect time: 656754496,08/16/2009,08/15/2009,24:00:00
Incorrect time: 884388382,08/13/2009,08/12/2009,24:00:00
Incorrect time: 678406708,08/12/2009,04/01/2004,24:00:00
Incorrect time: 986023642,08/11/2009,08/04/2009,24:00:00
Incorrect time: 861838411,08/09/2009,08/08/2009,24:00:00
Incorrect time: 317281632,08/09/2009,08/07/2009,24:00:00
Incorrect time: 354767560,08/08/2009,08/07/2009,24:00:00
Incorrect time: 245759440,08/08/2009,06/01/2007,24:00:00
Incorrect time: 800225107,08

Incorrect time: 545823758,12/26/2008,12/25/2008,24:00:00
Incorrect time: 393405561,12/26/2008,12/25/2008,24:00:00
Incorrect time: 709439701,12/24/2008,12/23/2008,24:00:00
Incorrect time: 816244706,12/24/2008,12/23/2008,24:00:00
Incorrect time: 744413687,12/24/2008,11/28/2008,24:00:00
Incorrect time: 719358951,12/24/2008,08/01/2004,24:00:00
Incorrect time: 334871553,12/23/2008,12/22/2008,24:00:00
Incorrect time: 742817994,12/23/2008,12/18/2008,24:00:00
Incorrect time: 315179896,12/22/2008,12/21/2008,24:00:00
Incorrect time: 153665625,12/19/2008,12/18/2008,24:00:00
Incorrect time: 310014089,12/19/2008,11/25/2008,24:00:00
Incorrect time: 438652242,12/18/2008,12/12/2008,24:00:00
Incorrect time: 345361743,12/17/2008,12/13/2008,24:00:00
Incorrect time: 714716691,12/17/2008,12/01/2008,24:00:00
Incorrect time: 858168344,12/17/2008,09/02/2008,24:00:00
Incorrect time: 797356213,12/15/2008,12/11/2008,24:00:00
Incorrect time: 754668657,12/13/2008,12/12/2008,24:00:00
Incorrect time: 476300674,12/12

Incorrect time: 425029400,09/01/2008,08/31/2008,24:00:00
Incorrect time: 965457316,09/01/2008,08/31/2008,24:00:00
Incorrect time: 798537111,09/01/2008,08/31/2008,24:00:00
Incorrect time: 164566062,09/01/2008,08/28/2008,24:00:00
Incorrect time: 136037731,08/30/2008,08/29/2008,24:00:00
Incorrect time: 943139649,08/30/2008,08/15/2008,24:00:00
Incorrect time: 897095171,08/28/2008,08/19/2008,24:00:00
Incorrect time: 658659271,08/27/2008,08/25/2008,24:00:00
Incorrect time: 871199583,08/26/2008,08/25/2008,24:00:00
Incorrect time: 177663495,08/26/2008,08/24/2008,24:00:00
Incorrect time: 938481704,08/25/2008,08/24/2008,24:00:00
Incorrect time: 869447973,08/25/2008,08/22/2008,24:00:00
Incorrect time: 258156100,08/24/2008,08/23/2008,24:00:00
Incorrect time: 965351497,08/24/2008,06/24/2008,24:00:00
Incorrect time: 616893734,08/24/2008,02/01/2007,24:00:00
Incorrect time: 568047046,08/22/2008,08/15/2008,24:00:00
Incorrect time: 239441444,08/22/2008,06/18/2008,24:00:00
Incorrect time: 109122294,08/21

Incorrect time: 365881645,05/16/2008,02/21/2008,24:00:00
Incorrect time: 908569528,05/15/2008,05/14/2008,24:00:00
Incorrect time: 695559330,05/14/2008,05/13/2008,24:00:00
Incorrect time: 794223402,05/12/2008,05/11/2008,24:00:00
Incorrect time: 780586542,05/12/2008,05/07/2008,24:00:00
Incorrect time: 583311467,05/11/2008,05/09/2008,24:00:00
Incorrect time: 114669331,05/09/2008,05/08/2008,24:00:00
Incorrect time: 937145860,05/09/2008,10/12/2007,24:00:00
Incorrect time: 610523896,05/09/2008,05/09/1988,24:00:00
Incorrect time: 620345690,05/08/2008,04/15/2008,24:00:00
Incorrect time: 481936611,05/08/2008,04/09/2008,24:00:00
Incorrect time: 893223094,05/07/2008,05/06/2008,24:00:00
Incorrect time: 464888973,05/07/2008,05/05/2008,24:00:00
Incorrect time: 612765664,05/07/2008,03/12/2008,24:00:00
Incorrect time: 635227790,05/07/2008,08/01/2006,24:00:00
Incorrect time: 989208996,05/06/2008,05/05/2008,24:00:00
Incorrect time: 359612242,05/05/2008,05/04/2008,24:00:00
Incorrect time: 619596798,05/05

Incorrect time: 838746807,09/26/2007,09/25/2007,24:00:00
Incorrect time: 203465327,09/25/2007,09/24/2007,24:00:00
Incorrect time: 367543125,09/25/2007,09/24/2007,24:00:00
Incorrect time: 718750965,09/25/2007,09/23/2007,24:00:00
Incorrect time: 389586598,09/25/2007,09/20/2007,24:00:00
Incorrect time: 116247690,09/24/2007,09/21/2007,24:00:00
Incorrect time: 865309426,09/24/2007,09/10/2007,24:00:00
Incorrect time: 626739576,09/23/2007,09/22/2007,24:00:00
Incorrect time: 767832308,09/23/2007,09/22/2007,24:00:00
Incorrect time: 729635496,09/22/2007,09/21/2007,24:00:00
Incorrect time: 723310652,09/22/2007,09/21/2007,24:00:00
Incorrect time: 297150699,09/22/2007,09/07/2007,24:00:00
Incorrect time: 503302472,09/21/2007,09/20/2007,24:00:00
Incorrect time: 937933972,09/21/2007,09/20/2007,24:00:00
Incorrect time: 896273910,09/21/2007,05/11/2007,24:00:00
Incorrect time: 333586012,09/18/2007,09/16/2007,24:00:00
Incorrect time: 204213836,09/18/2007,09/15/2007,24:00:00
Incorrect time: 499226029,09/15