# Regular Expressions Assignment

## Part 1: Dealing with Noisy Data



We use Pandas to extract information from Wikipedia page `https://en.wikipedia.org/wiki/List_of_lakes_by_area` about the largest lakes in the world. We want to extract from this web page the list of the lakes and their area.

The code below extracts the information from Wikipedia, and generates a CSV file, `largest_lakes.csv` , with the information. (You can also find the `largest_lakes.csv` file attached.)

```python
import pandas as pd
# Extract the tables that appear in the HTML page, which contain the term "Water Volume"
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_lakes_by_area', match = 'Water volume', header=0)
# Get the first table from the list of tables extracted from the HTML page, which is the one that we want
lakes = tables[0]
# Replace the character \xa0 with space
lakes.replace(to_replace = r'\xa0', value= r' ', regex=True, inplace=True)
# Save the Name and Area columns as a CSV file
lakes[['Name', 'Area']].to_csv("largest_lakes.csv", index=False)
```

Now you open the file, and you can read its contents in memory, in the `lines` list of strings (one entry per line)

In [44]:

f = open("largest_lakes.csv","r")
lines = f.read().splitlines()
f.close()

If you take a look at the extracted information though, you see that is a bit messy. You see that the names of the lakes have leftovers from footnotes, and the area column contains extra characters that we do not need.

In [45]:
lines[1:10]

['Caspian Sea*,"2 (143,000 sq mi)"',
 'Superior[n 1],"2 (31,700 sq mi)[14]"',
 'Victoria,"2 (26,590 sq mi)"',
 'Huron[n 1],"2 (23,000 sq mi)[14]"',
 'Michigan[n 1],"2 (22,000 sq mi)[14]"',
 'Tanganyika,"2 (12,600 sq mi)"',
 'Baikal,"2 (12,200 sq mi)"',
 'Great Bear Lake,"2 (12,000 sq mi)"',
 'Malawi,"2 (11,400 sq mi)"']

Our goal is to use regular expressions to process the file and create a clean file with the name of the lake, and the area of the lake (in square miles) in the next column. The area column should be an integer (withouts commas). For example, the 10 first lines (listed above) should be transformed into:

```
Caspian Sea	143000
Superior	31700
Victoria	26590
Huron	23000
Michigan	22000
Tanganyika	12600
Baikal	12200
Great Bear Lake	12000
Malawi	11400
```

In [46]:
# Your code here (It should not be more than 10-15 lines, at most)
import re

#lake name
#^[^,]+
#size
#\(([^\)]+)\)


lakereg = re.compile(r'^[^,][a-zA-Z" "]+')
sizereg = re.compile(r'\(([^\)]+)\)')

st = ""
for line in lines[1:10]:
    lake = lakereg.findall(line)
    
    st += lake[0]
    st += '\t'
    cleansize = sizereg.findall(line)
    sizereg1 = re.sub(r'[,]+', "", cleansize[0])    
    sizereg2 = re.compile('[0-9]+')
    size = sizereg2.findall(sizereg1)
    st += size[0]
    st += '\n'
    
print(st)

Caspian Sea	143000
Superior	31700
Victoria	26590
Huron	23000
Michigan	22000
Tanganyika	12600
Baikal	12200
Great Bear Lake	12000
Malawi	11400



## Part 2: Reformatting a Data File

You are given a file `roster-f2018.txt` that contains the roster of students enrolled in the class.

The file contains three **tab-separated** columns: `Section`, `Name`, `Email`. 

* The `Section` can be either `S1` or `S2`.
* The `Name` column has the format [_lastname, firstname middlename_]. Not all students have a middle name listed.
* The email is the NYU email of the student.

In [47]:
f = open("roster-f2018.txt")
lines = f.read().splitlines()
f.close()

In [48]:
# Last 10 lines of the file
lines[-10:]

['S2\tThandla,Rajiv\trt1645@nyu.edu',
 'S2\tTsoi,Ashley Shengqiao\tast418@nyu.edu',
 'S2\tTurdaliev,Komiljon\tkt1673@nyu.edu',
 'S2\tVerma,Sunny\tsv1444@nyu.edu',
 'S2\tXu,Sally Jing\tsjx203@nyu.edu',
 'S2\tYang,Simon\tsy1924@nyu.edu',
 'S2\tYao,Karen H\tkhy236@nyu.edu',
 'S2\tYoon,Paul J\tpjy226@nyu.edu',
 'S2\tZheng,Kaitlyn H\tkhz216@nyu.edu',
 'S2\tde Valk,Daniel\tddv228@nyu.edu']

You are asked to reformat the file. The reformatted file should be tab-separated, and should include five columns:

`{section}\t{email}\t{first}\t{middle}\t{last}`

The requirements:
* The first column should be `Section`, but instead of the values `S1` and `S2`, should say `Section 1` and `Section 2`, respectively.
* The second column should be the `NetId`. For someone with the email `pi1@nyu.edu`, the NetId is `pi1`.
* The third column should be the first name of the student.
* The fourth column should be the middle name of the student (should be empty when there is no middle name)
* The fifth column should be the last name of the student

You can see an [example of the reformatted file](https://docs.google.com/spreadsheets/d/10j33VgMU6Kjf1MIUnNWpEKXLjLCNXwxjVkMwDU74n08/edit?usp=sharing).

In [50]:
# Your code here (It should not be more than 10-15 lines, at most)
import re
#Section 1 Section 2
#NetId : email address before @
#first name 
#middle name
#last name
netregex = re.compile(r'.+?(?=@)')


st1 = ''
cnt = 0
for line in lines[1:]:
    if cnt == 0:
        st1 += 'Section  \tNetId  \tFirst  \tMiddle\t  Last\n'
    else:
        l = line.split("\t")
        if l[0] == 'S1':
            st1 += 'Section 1'
        else:
            st1 += 'Section 2'
        st1 += '\t'
        net = netregex.findall(l[2])
        st1 += net[0]
        st1 += '\t'
        name = l[1].split(',')
        fn = name[0]
        names = name[1].split(' ')
        ln = ''
        mn = ''
        if len(names) == 2:
            ln = names[0]
            mn = names[1]
        else:
            ln = name[1]
            mn = '     '
        st1 += fn
        st1 += '\t'
        st1 += mn
        st1 += '\t'    
        st1 += ln 
        st1 += '\n'
    cnt += 1        
print(st1)

    

    

Section  	NetId  	First  	Middle	  Last
Section 1	kcc407	Cai	     	Kent
Section 1	jmc1250	Chabora	     	Jason
Section 1	jc7015	Chang	     	Jonah
Section 1	ac6325	Chen	     	Amy
Section 1	vc1238	Cherevkov	     	Vlad
Section 1	tff234	Farman-Farmaian	     	Teymour
Section 1	xf365	Fu	     	Judy
Section 1	ggl245	Garcia	     	Gabriel
Section 1	wh916	Hou	     	Wangrui
Section 1	jli232	Ingraham	     	Jess
Section 1	ak5562	Khosla	     	Aditya
Section 1	ak5635	Khosla	     	Arkin
Section 1	pk1676	Kundu	     	Pratyush
Section 1	al4533	Lakhotia	     	Akshat
Section 1	gl1144	Lee	     	GP
Section 1	al5361	Lin	     	Allen
Section 1	jl7028	Lin	     	Jonathan
Section 1	yl4042	Ling	     	Yuheng
Section 1	col223	Llacer	Orayani	Cristina
Section 1	rl2838	Loney	     	Rahul
Section 1	rm4467	Maeda	     	Riku
Section 1	cn1095	Nakajima	     	Christie
Section 1	nn1079	Ng	     	Nicole
Section 1	cro257	Osufsen	     	Chris
Section 1	krp354	Palmer	     	Kenton
Section 1	tr1328	Ristova	     	Teona
Section 1	rr2875	Rub

## Part 3: Detecting Problematic Data Entries

We are going to process the official NYPD Complaints dataset (available from the NYC Open Data). The dataset contains all the complaints to NYPD that were reported from 2006 until today (the RPT_DT contains the date when the incident was reported.)

The code below fetches the latest version of the dataset from NYC Open Data, and creates a smaller file with just 4 columns: CMPLNT_NUM (the complaint number), RPT_DT (the date the incident was reported), CMPLNT_FR_DT (the date the incident occurred), and CMPLNT_FR_TM (the time the incident occurred).

```python
import pandas as pd
# From https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i/data
!curl 'https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD' -o nypd.csv
df = pd.read_csv('nypd.csv')
df[['CMPLNT_NUM','RPT_DT','CMPLNT_FR_DT','CMPLNT_FR_TM']].dropna().to_csv("nypd_short.csv.gz", index=False, compression='gzip')
del(df)
```

The code below loads the shortened file in memory (in the `lines` variable). You can take a look at the

In [1]:
import gzip
with gzip.open('nypd_short.csv.gz', 'rt') as f:
    lines = f.read().splitlines()

In [2]:
lines[:10]

['CMPLNT_NUM,RPT_DT,CMPLNT_FR_DT,CMPLNT_FR_TM',
 '101109527,12/31/2015,12/31/2015,23:45:00',
 '153401121,12/31/2015,12/31/2015,23:36:00',
 '569369778,12/31/2015,12/31/2015,23:30:00',
 '968417082,12/31/2015,12/31/2015,23:30:00',
 '641637920,12/31/2015,12/31/2015,23:25:00',
 '365661343,12/31/2015,12/31/2015,23:18:00',
 '608231454,12/31/2015,12/31/2015,23:15:00',
 '265023856,12/31/2015,12/31/2015,23:15:00',
 '989238731,12/31/2015,12/31/2015,23:15:00']

Our focus for this assignment will be the columns `CMPLNT_FR_DT` and `CMPLNT_FR_TM`, which record the date time that the crime has **occurred**. (Note that the date that the incident was reported and the date the incident has occurred are not necessarily the same, and sometimes it takes years for an incident to be reported.) The date is recorded in the MM/DD/YYYY format, and the time is recorded as a 24-hr time (00:00 to 23:59)

Unfortunately, the dataset seems to include some dates in the `CMPLNT_FR_DT` column that are incorrect, and some times in the `CMPLNT_FR_TM` that are incorrect. Your task is to write code that uses regular expressions to detect these entries, and print them out. 

* You should check the `CMPLNT_FR_DT` column for correctness. In general, any date that is not 19xx or 20xx should be marked as **definitely incorrect**. Dates that are before 1930 (i.e., almost 90 years have passed!) should also be treated as **likely incorrect**.
* You should also check the `CMPLNT_FR_TM` column and detect any times that are not following the 24-hr time format (00:00 to 23:59).

In [21]:
# Detect incorrect dates and print out the incorrect lines
import re
#DT
#not 19xE\x 20xx incorrect
#1930 incorrect
#FROM_TM  24 hr time format

regex = re.compile(r'^[0-9]{2}')

for line in lines[1:]:
    dta = line.split(",")
    #not 19xx or 20xx
    dt = dta[1]
    yrDT = dt.split('/')
    yr = yrDT[2]
    year = regex.findall(yr)    
    #if dates are before 1930
    if int(yr) < 1930:
        print(line)
    #if dates is not 19xx or 20xx
    elif year[0] != '19' and year[0] != '20':
        print(line)
    
    

In [42]:
# Detect incorrect times and print out the incorrect lines
import re
#check the 24-hr time format
regex = re.compile(r'^(24:00|2[0-3]:[0-5][0-9]|[0-1][0-9]:[0-5][0-9])')
for line in lines[1:]:
    dta = line.split(",")
    dt = dta[3]
    xx = regex.findall(dt)
    if not (xx[0] == dt[:5]):
        print(line)
    
    