# Intermediate Regex Homework

## UFO sightings

The [ufo-reports](https://github.com/planetsig/ufo-reports) GitHub repository contains reports of UFO sightings downloaded from the [National UFO Reporting Center](http://www.nuforc.org/) website. One of the data fields is the **duration of the sighting**, which includes **free-form text**. These are some example entries:

- 45 minutes
- 1-2 hrs
- 20 seconds
- 1/2 hour
- about 3 mins
- several minutes
- one hour?
- 5min

Here is **how to read in the file:**

- Use the pandas **`read_csv()`** function to read directly from this [URL](https://raw.githubusercontent.com/planetsig/ufo-reports/master/csv-data/ufo-scrubbed-geocoded-time-standardized.csv).
- Use the **`header=None`** parameter to specify that the data does not have a header row.
- Use the **`nrows=100`** parameter to specify that you only want to read in the first 100 rows.
- Save the relevant Series as a Python list, just like we did in a class exercise.

In [148]:
import pandas as pd
import re

In [149]:
#was getting URL error, so saved data on my disk
path = "../data/ufo.csv"

In [150]:
ufo_df = pd.read_csv(path, header = None, nrows = 100)

In [151]:
list_of_duration = list(ufo_df[6])

Your assignment is to **normalize the duration data for the first 100 rows** by splitting each entry into two parts:

- The first part should be a **number**: either a whole number (such as '45') or a decimal (such as '0.5').
- The second part should be a **unit of time**: either 'hr' or 'min' or 'sec'

The expected output is a **list of tuples**, containing the **original (unedited) string**, the **number**, and the **unit of time**. Here is a what the output should look like:

> `clean_durations = [('45 minutes', '45', 'min'), ('1-2 hrs', '1', 'hr'), ('20 seconds', '20', 'sec'), ...]`

Here are the **"rules" and guiding principles** for this assignment:

- The normalized duration does not have to be exactly correct, but it must be at least **within the given range**. For example:
    - If the duration is '20-30 min', acceptable answers include '20 min' and '30 min'.
    - If the duration is '1/2 hour', the only acceptable answer is '0.5 hr'.
- When a number is not given, you should make a **"reasonable" substitution for the words**. For example:
    - If the duration is 'several minutes', you can approximate this as '5 min'.
    - If the duration is 'couple minutes', you can approximate this as '2 min'.
- You are not allowed to **skip any entries**. (Your list of tuples should have a length of 100.)
- Try to use **as few substitutions as possible**, and make your regular expression **as simple as possible**.
- Just because you don't get an error doesn't mean that your code was successful. Instead, you should **check each entry by hand** to see if it produced an acceptable result.

In [152]:
duration = [re.sub("1/2","0.5",i) for i in list_of_duration]

In [153]:
duration = [re.sub("several","5",i) for i in duration]

In [154]:
duration = [re.sub("one","1",i) for i in duration]

In [155]:
duration = [re.sub("couple|few","2",i) for i in duration]

In [156]:
duration = [re.sub("hours|hour|hrs","hr",i) for i in duration]

In [157]:
duration = [re.sub("minutes|mins|min.","min",i) for i in duration]

In [158]:
duration = [re.sub("seconds|sec.|secs|second|secs.","sec",i) for i in duration]

In [159]:
regex = r'(\d+\.?\d*)(?=.*(hr|min|sec))'

In [160]:
re.search(regex,duration[10]).group(2)

'min'

In [161]:
clean_duration =[]

In [162]:
for i in range(len(duration)):
    clean_duration.append((list_of_duration[i],re.search(regex,duration[i]).group(1),re.search(regex, duration[i]).group(2)))

In [163]:
clean_duration

[('45 minutes', '45', 'min'),
 ('1-2 hrs', '1', 'hr'),
 ('20 seconds', '20', 'sec'),
 ('1/2 hour', '0.5', 'hr'),
 ('15 minutes', '15', 'min'),
 ('5 minutes', '5', 'min'),
 ('about 3 mins', '3', 'min'),
 ('20 minutes', '20', 'min'),
 ('3  minutes', '3', 'min'),
 ('several minutes', '5', 'min'),
 ('5 min.', '5', 'min'),
 ('3 minutes', '3', 'min'),
 ('30 min.', '30', 'min'),
 ('3 minutes', '3', 'min'),
 ('30 seconds', '30', 'sec'),
 ('20minutes', '20', 'min'),
 ('2 minutes', '2', 'min'),
 ('20-30 min', '20', 'min'),
 ('20 sec.', '20', 'sec'),
 ('45 minutes', '45', 'min'),
 ('20 minutes', '20', 'min'),
 ('one hour?', '1', 'hr'),
 ('5-6 minutes', '5', 'min'),
 ('1 minute', '1', 'min'),
 ('3 seconds', '3', 'sec'),
 ('30 seconds', '30', 'sec'),
 ('approx: 30 seconds', '30', 'sec'),
 ('5min', '5', 'min'),
 ('15 minutes', '15', 'min'),
 ('4.5 or more min.', '4.5', 'min'),
 ('3 minutes', '3', 'min'),
 ('30mins.', '30', 'min'),
 ('3 min', '3', 'min'),
 ('5 minutes', '5', 'min'),
 ('3 to 5 min', '

**Bonus tasks:**

- Try reading in **more than 100 rows**, and see if your code still produces the correct results.
- When a range is specified (such as '1-2 hrs' or '10 to 15 sec'), **calculate the exact midpoint** ('1.5 hr' or '12.5 sec') to use in your normalized data.