# INFO 212: Data Science Programming 1
___

### Week 7: Data Cleaning and Preparation
___

### Mon., May 14, and Wed., May 16, 2018
---

**Question:**
- What capabilities does Python provide to clean, transform, and re-arrange data? 

**Objectives:**
- Filter out and fill in missing values
- Remove duplicates
- Transform data, replace values, and rename index
- Discretize and bin data
- Resample data 
- Compute indicators
- Manipulate strings and use regular expressions

In [3]:
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)
%matplotlib inline

```
!cat examples/ex5.csv```

In [1]:
!cat examples/examples/ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

In [3]:
pd.read_csv("examples/examples/ex5.csv", header = None, index_col = 0).isnull()

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
something,False,False,False,False,False
one,False,False,False,False,True
two,False,False,True,False,False
three,False,False,False,False,False


## Handling Missing Data

Missing data occurs commonly in many data analysis applications.
All of the descriptive statistics on pandas objects exclude missing data by default. For numeric data, pandas uses the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected.

```
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
string_data.isnull()```

In [7]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
string_data.isnull().sum()

1

```
string_data[0] = None
string_data.isnull()```

In [5]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### Filtering Out Missing Data
In statistics applications, NA data may either be data that does not exist or that exists but was not observed
(through problems with data collection, for example). When cleaning up data for
analysis, it is often important to do analysis on the missing data itself to identify data
collection problems or potential biases in the data caused by missing data.

**dropna()**  Filter axis labels based on whether values for each label have missing data. For Series, it returns non-null data.

```
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()```

In [8]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

```
data[data.notnull()]```

In [11]:
data[data.notnull()]


0    1.0
2    3.5
4    7.0
dtype: float64

**dropna()** With DataFrame objects, things are a bit more complex. You may want to drop rows
or columns that are all NA or only those containing any NAs. dropna by default drops
any row containing a missing value.

```
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data
cleaned```

In [16]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


Passing how='all' will only drop rows that are all NA:

```
data.dropna(how='all')```

In [17]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


```
data[4] = NA
data
data.dropna(axis=1, how='all')```

In [20]:
data[4] = NA
data
data.dropna(axis=1, how='all')


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. Suppose
you want to keep only rows containing a certain number of observations. You can
indicate this with the thresh argument.

```
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
df.dropna()
df.dropna(thresh=2)```

In [76]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df.iloc[:5,0] = NA
df
df.dropna(thresh=2)

Unnamed: 0,0,1,2
0,,,
1,,,
2,,,1.375236
3,,,-2.868018
4,,1.287155,-0.574306
5,0.495327,0.39605,0.588798
6,-1.281757,2.029923,-0.501945


##### Filling In Missing Data
Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways. For most purposes,
the fillna method is the workhorse function to use. Calling fillna with a
constant replaces missing values with that value:

```
df.fillna(0)```

In [28]:
df.fillna({1:0.5,2:0})

Unnamed: 0,0,1,2
0,0.107657,0.5,0.0
1,-0.017007,0.5,0.0
2,1.634736,0.5,0.45794
3,0.555154,0.5,-0.440554
4,-0.30135,0.498791,-0.823991
5,1.320566,0.507965,-0.653438
6,0.18698,-0.391725,-0.272293


Calling fillna with a dict, you can use a different fill value for each column:

```
df.fillna({1: 0.5, 2: 0})```

fillna returns a new object, but you can modify the existing object in-place:

```
_ = df.fillna(0, inplace=True)
df```

In [29]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,0.107657,0.0,0.0
1,-0.017007,0.0,0.0
2,1.634736,0.0,0.45794
3,0.555154,0.0,-0.440554
4,-0.30135,0.498791,-0.823991
5,1.320566,0.507965,-0.653438
6,0.18698,-0.391725,-0.272293


The same interpolation methods available for reindexing can be used with fillna:

```
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
df.fillna(method='ffill')
df.fillna(method='ffill', limit=2)```

In [36]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
df.fillna(method='ffill')
df.fillna(method='ffill', limit=2,axis =1)

Unnamed: 0,0,1,2
0,-1.211411,-0.258867,-0.581647
1,-1.260421,0.464575,-1.070241
2,0.804223,0.804223,2.01039
3,-0.887104,-0.887104,-0.267217
4,0.483338,0.483338,0.483338
5,0.399594,0.399594,0.399594


With fillna you can do lots of other things with a little creativity. For example, you
might pass the mean or median value of a Series:

```
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())```

In [37]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

### Exercise for cleaning missing data for the following data sets:
A dataset of events that occured in American Football games:
[Detailed NFL Play-by-Play Data 2009-2017](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016/data)

A dataset of building permits issued in San Francisco:
[San Francisco Building Permits](https://www.kaggle.com/aparnashastry/building-permit-applications-data/data)

Here's what we're going to do:

- Take a first look at the data
- See how many missing data points we have
- Figure out why the data is missing
- Drop missing values
- Filling in missing values


In [40]:
nfl = pd.read_csv("datasets/NFL-play1.csv")
nfl.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,...,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,...,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,...,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


In [41]:
nfl.shape

(407688, 102)

In [46]:
column_null = nfl.isnull().sum()
column_null.sum() / (nfl.shape[0]*nfl.shape[1])

0.2487214126835169

In [70]:
drop_na_rows = nfl.dropna(thresh=88)
drop_na_rows.shape

(4, 102)

In [85]:
nfl['time'].fillna('15:00')

0         15:00
1         14:53
2         14:16
3         13:35
4         13:27
5         13:16
6         12:40
7         12:11
8         11:34
9         11:24
          ...  
407678    00:53
407679    00:44
407680    00:44
407681    00:38
407682    00:32
407683    00:28
407684    00:28
407685    00:24
407686    00:14
407687    00:00
Name: time, Length: 407688, dtype: object

In [82]:
nfl_fill_time.isnull().sum()

0

In [84]:
nfl

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,...,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,...,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,...,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009
5,2009-09-10,2009091000,2,1,1.0,13:16,14,3496.0,11.0,TEN,...,,0.558929,0.441071,0.578453,0.421547,0.441071,-0.019524,,,2009
6,2009-09-10,2009091000,2,1,2.0,12:40,13,3460.0,36.0,TEN,...,0.163935,0.578453,0.421547,0.582881,0.417119,0.421547,-0.004427,-0.010456,0.006029,2009
7,2009-09-10,2009091000,2,1,3.0,12:11,13,3431.0,29.0,TEN,...,,0.582881,0.417119,0.617544,0.382456,0.417119,-0.034663,,,2009
8,2009-09-10,2009091000,2,1,4.0,11:34,12,3394.0,37.0,TEN,...,,0.617544,0.382456,0.591489,0.408511,0.382456,0.026054,,,2009
9,2009-09-10,2009091000,3,1,1.0,11:24,12,3384.0,10.0,TEN,...,0.541602,0.591489,0.408511,0.585405,0.414595,0.591489,-0.006084,-0.024526,0.018442,2009


## Data Transformation

### Removing Duplicates

```
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data```

The DataFrame method duplicated returns a boolean Series indicating whether each
row is a duplicate (has been observed in a previous row) or not:

```
data.duplicated()```

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is
False:

```
data.drop_duplicates()```

Relatedly, drop_duplicates returns a DataFrame where the duplicated array is
False:

```
data['v1'] = range(7)
data.drop_duplicates(['k1'])```

duplicated and drop_duplicates by default keep the first observed value combination.
Passing keep='last' will return the last one:

```
data.drop_duplicates(['k1', 'k2'], keep='last')```

### Transforming Data Using a Function or Mapping
For many datasets, you may wish to perform some transformation based on the values
in an array, Series, or column in a DataFrame. Consider the following hypothetical
data collected about various kinds of meat:

```
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data```

Suppose you wanted to add a column indicating the type of animal that each food
came from. Let’s write down a mapping of each distinct meat type to the kind of
animal:

```
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}```

The map method on a Series accepts a function or dict-like object containing a mapping,
but here we have a small problem in that some of the meats are capitalized and
others are not. Thus, we need to convert each value to lowercase using the str.lower
Series method:

```
lowercased = data['food'].str.lower()
lowercased
data['animal'] = lowercased.map(meat_to_animal)
data```

We could also have passed a function that does all the work:

```
data['food'].map(lambda x: meat_to_animal[x.lower()])```

Using map is a convenient way to perform element-wise transformations and other
data cleaning–related operations.

### Replacing Values
Filling in missing data with the fillna method is a special case of more general value
replacement. As you’ve already seen, map can be used to modify a subset of values in
an object but replace provides a simpler and more flexible way to do so. Let’s consider
this Series:

```
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data```

The -999 values might be sentinel values for missing data. To replace these with NA
values that pandas understands, we can use replace, producing a new Series (unless
you pass inplace=True):

```
data.replace(-999, np.nan)```

If you want to replace multiple values at once, you instead pass a list and then the
substitute value:

```
data.replace([-999, -1000], np.nan)```

To use a different replacement for each value, pass a list of substitutes:

```
data.replace([-999, -1000], [np.nan, 0])```

The argument passed can also be a dict:

```
data.replace({-999: np.nan, -1000: 0})```

### Renaming Axis Indexes
Like values in a Series, axis labels can be similarly transformed by a function or mapping
of some form to produce new, differently labeled objects. You can also modify
the axes in-place without creating a new data structure. Here’s a simple example:

```
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])```

Like a Series, the axis indexes have a map method:

```
transform = lambda x: x[:4].upper()
data.index.map(transform)```

You can assign to index, modifying the DataFrame in-place:

```
data.index = data.index.map(transform)
data```

If you want to create a transformed version of a dataset without modifying the original,
a useful method is rename:

```
data.rename(index=str.title, columns=str.upper)```

Notably, rename can be used in conjunction with a dict-like object providing new values
for a subset of the axis labels:

```
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})```

rename saves you from the chore of copying the DataFrame manually and assigning
to its index and columns attributes. Should you wish to modify a dataset in-place,
pass inplace=True:

```
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data```

### Discretization and Binning
Continuous data is often discretized or otherwise separated into “bins” for analysis.
Suppose you have data about a group of people in a study, and you want to group
them into discrete age buckets:

```
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]```

In [1]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To
do so, you have to use cut, a function in pandas:

```
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats```

In [4]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

The object pandas returns is a special Categorical object. The output you see
describes the bins computed by pandas.cut. You can treat it like an array of strings
indicating the bin name; internally it contains a categories array specifying the distinct
category names along with a labeling for the ages data in the codes attribute:

```
cats.codes
cats.categories
pd.value_counts(cats)```

In [6]:
cats.codes
cats.categories


IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.

Consistent with mathematical notation for intervals, a parenthesis means that the side
is open, while the square bracket means it is closed (inclusive). You can change which
side is closed by passing right=False:

```
pd.cut(ages, [18, 26, 36, 61, 100], right=False)```

In [7]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

```
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)```

In [8]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

If you pass an integer number of bins to cut instead of explicit bin edges, it will compute
equal-length bins based on the minimum and maximum values in the data.
Consider the case of some uniformly distributed data chopped into fourths:

```
data = np.random.rand(20)
pd.cut(data, 4, precision=2)```

In [9]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.73, 0.96], (0.25, 0.49], (0.0074, 0.25], (0.0074, 0.25], (0.49, 0.73], ..., (0.49, 0.73], (0.73, 0.96], (0.73, 0.96], (0.73, 0.96], (0.49, 0.73]]
Length: 20
Categories (4, interval[float64]): [(0.0074, 0.25] < (0.25, 0.49] < (0.49, 0.73] < (0.73, 0.96]]

The precision=2 option limits the decimal precision to two digits.

A closely related function, qcut, bins the data based on sample quantiles. Depending
on the distribution of the data, using cut will not usually result in each bin having the
same number of data points. Since qcut uses sample quantiles instead, by definition
you will obtain roughly equal-size bins:

```
data = np.random.randn(1000)  # Normally distributed
cats = pd.qcut(data, 4)  # Cut into quartiles
cats
pd.value_counts(cats)```

In [10]:
data = np.random.randn(1000)  # Normally distributed
cats = pd.qcut(data, 4)  # Cut into quartiles
cats
pd.value_counts(cats)

(0.626, 3.928]                   250
(-0.0171, 0.626]                 250
(-0.691, -0.0171]                250
(-2.9499999999999997, -0.691]    250
dtype: int64

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):

```
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])```

## String Manipulation
Python has long been a popular raw data manipulation language in part due to its
ease of use for string and text processing. Most text operations are made simple with
the string object’s built-in methods. For more complex pattern matching and text
manipulations, regular expressions may be needed. pandas adds to the mix by enabling
you to apply string and regular expressions concisely on whole arrays of data,
additionally handling the annoyance of missing data.

### String Object Methods
In many string munging and scripting applications, built-in string methods are sufficient.
As an example, a comma-separated string can be broken into pieces with
split:

```
val = 'a,b,  guido'
val.split(',')```

split is often combined with strip to trim whitespace (including line breaks):

```
pieces = [x.strip() for x in val.split(',')]
pieces```

These substrings could be concatenated together with a two-colon delimiter using
addition:

```
first, second, third = pieces
first + '::' + second + '::' + third```

But this isn’t a practical generic method. A faster and more Pythonic way is to pass a
list or tuple to the join method on the string '::':

```
'::'.join(pieces)```

Other methods are concerned with locating substrings. Using Python’s in keyword is
the best way to detect a substring, though index and find can also be used:

```
'guido' in val
val.index(',')
val.find(':')```

Note the difference between find and index is that index raises an exception if the
string isn’t found (versus returning –1):

```
val.index(':')```

Relatedly, count returns the number of occurrences of a particular substring:

```
val.count(',')```

replace will substitute occurrences of one pattern for another. It is commonly used
to delete patterns, too, by passing an empty string:

```
val.replace(',', '::')
val.replace(',', '')```

### Regular Expressions
Regular expressions provide a flexible way to search or match (often more complex)
string patterns in text. A single expression, commonly called a regex, is a string
formed according to the regular expression language. Python’s built-in re module is
responsible for applying regular expressions to strings.

The re module functions fall into three categories: pattern matching, substitution,
and splitting. Naturally these are all related; a regex describes a pattern to locate in the
text, which can then be used for many purposes. 

Suppose we wanted to split a string with a variable number of whitespace characters
(tabs, spaces, and newlines). The regex describing one or more whitespace characters
is \s+:

```
import re
text = "foo    bar\t baz  \tqux"
re.split('\s+', text)```

In [1]:
import re
text = "foo    bar\t baz  \tqux"
re.split('\s+', text)



['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, and
then its split method is called on the passed text. You can compile the regex yourself
with re.compile, forming a reusable regex object:

```
regex = re.compile('\s+')
regex.split(text)```

If, instead, you wanted to get a list of all patterns matching the regex, you can use the
findall method:

```
regex.findall(text)```

match and search are closely related to findall. While findall returns all matches
in a string, search returns only the first match. More rigidly, match only matches at
the beginning of the string. As a less trivial example, let’s consider a block of text and
a regular expression capable of identifying most email addresses:

```
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)```

Using findall on the text produces a list of the email addresses:

```
regex.findall(text)```

search returns a special match object for the first email address in the text. For the
preceding regex, the match object can only tell us the start and end position of the
pattern in the string:

```
m = regex.search(text)
m
text[m.start():m.end()]```

regex.match returns None, as it only will match if the pattern occurs at the start of the
string:

```
print(regex.match(text))```

Relatedly, sub will return a new string with occurrences of the pattern replaced by the
a new string:

```
print(regex.sub('REDACTED', text))```

Suppose you wanted to find email addresses and simultaneously segment each
address into its three components: username, domain name, and domain suffix. To
do this, put parentheses around the parts of the pattern to segment:

```
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)```

A match object produced by this modified regex returns a tuple of the pattern components
with its groups method:

```
m = regex.match('wesm@bright.net')
m.groups()```

findall returns a list of tuples when the pattern has groups:

```
regex.findall(text)```

sub also has access to groups in each match using special symbols like \1 and \2. The
symbol \1 corresponds to the first matched group, \2 corresponds to the second, and
so forth:

```
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))```

### Excercise 
If a row in a DataFrame belongs to multiple categories, things are a bit more complicated.
Let’s look at the MovieLens 1M dataset.

```
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
.....: header=None, names=mnames)
movies[:10]
```

Adding indicator variables for each genre requires a little bit of wrangling. First, we
extract the list of unique genres in the dataset:
```
all_genres = []
for x in movies.genres:
.....: all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)

genres
```

One way to construct the indicator DataFrame is to start with a DataFrame of all
zeros:
```
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
```
Now, iterate through each movie and set entries in each row of dummies to 1. To do
this, we use the dummies.columns to compute the column indices for each genre:
```
gen = movies.genres[0]
gen.split('|')

dummies.columns.get_indexer(gen.split('|'))
```
Then, we can use .iloc to set values based on these indices:
```
for i, gen in enumerate(movies.genres):
.....: indices = dummies.columns.get_indexer(gen.split('|'))
.....: dummies.iloc[i, indices] = 1
.....:
```
Then, as before, you can combine this with movies:
```
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
```
