<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 1


---


### Lesson Guide
- [The Basics of `pandas` DataFrames](#basics)
    - [Loading Data](#loading)
    - [A Basic Examination of DataFrames](#examine)
    - [Selecting Columns](#selecting)
    - [Describing Data](#describing)
- [Exercise #1](#exercise-1)
- [Filtering and Sorting DataFrames](#filtering-sorting)
    - [Boolean Filtering](#filtering)
    - [Sorting](#sorting)
- [Exercise #2](#exercise-2)
- [Renaming, Adding, and Removing Columns](#columns)
    - [Renaming Columns](#renaming-columns)
    - [Adding Columns](#adding-columns)
    - [Removing Columns](#removing-columns)
- [Handling Missing Values](#missing)
    - [Finding Missing Values](#find-missing)
    - [Dropping Missing Values](#drop-missing)
    - [Filling in Missing Values](#fill-missing)


<a id='basics'></a>

## The Basics of `pandas` DataFrames

---

In [24]:
import pandas as pd
import numpy as np

<a id='loading'></a>
### Loading Data

**Q.1** Read in the data file.

```Python
users = pd.read_csv('../../../../resource-datasets/users/users.txt')
```

In [69]:
# ! ls /Users/Paxton615/GA/resource-datasets/

In [106]:
users = pd.read_csv('/Users/Paxton615/GA/resource-datasets/users/users.txt', sep='|')

**Q.2** Use kwargs to set appropriate data-reading parameters.

In [12]:
users.shape

(943, 5)

In [13]:
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


<a id='examine'></a>
### A Basic Examination of DataFrames

**Q.1** Print the type of `users`.

In [14]:
type(users)

pandas.core.frame.DataFrame

**Q.2** Print the first five rows, first 10 rows, and last two rows of `users`.

In [17]:
users.head(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


In [18]:
users.tail(2)

Unnamed: 0,user_id,age,gender,occupation,zip_code
941,942,48,F,librarian,78209
942,943,22,M,student,77841


**Q.3** Print the index and columns.

In [19]:
print(users.index)

RangeIndex(start=0, stop=943, step=1)


In [20]:
print(users.columns)

Index(['user_id', 'age', 'gender', 'occupation', 'zip_code'], dtype='object')


**Q.4** Find the dtypes of the columns.

In [21]:
users.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.5** Find the dimensions of the DataFrame.

In [22]:
users.shape

(943, 5)

**Q.6** Extract the underlying `numpy` array as a new variable.

In [26]:
print(users.values)
type(users.values)

[[1 24 'M' 'technician' '85711']
 [2 53 'F' 'other' '94043']
 [3 23 'M' 'writer' '32067']
 ...
 [941 20 'M' 'student' '97229']
 [942 48 'F' 'librarian' '78209']
 [943 22 'M' 'student' '77841']]


numpy.ndarray

<a id='selecting'></a>
### Selecting Columns

**Q.1** Assign the `gender` column to a variable.

In [27]:
sex = users.gender

In [28]:
sex.head()

0    M
1    F
2    M
3    M
4    F
Name: gender, dtype: object

_The former method is preferred, as columns can have names with special characters like periods or underscores that will create syntax issues with the latter._

**Q.2** What is the type of `gender`?

In [29]:
print(type(sex))
print(type(users.gender))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


**Q.3** Select `gender` and `occupation` as a new DataFrame.

In [30]:
new_df = users[['gender','occupation']]
new_df.head()

Unnamed: 0,gender,occupation
0,M,technician
1,F,other
2,M,writer
3,M,technician
4,F,other


In [40]:
dict(new_df)['gender'][:5] # dict() 可以把df变成dict，columns变成key，内容是个list

0    M
1    F
2    M
3    M
4    F
Name: gender, dtype: object

In [47]:
users.set_index('user_id', inplace=True)

<a id='describing'></a>
### Describing Data

**Q.1** Calculate the descriptive statistics for the numeric columns in the DataFrame (_which is the function default_).  

In [48]:
users.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

In [49]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


**Q.2** Describe the "object" (string) columns.

In [50]:
users.describe(include='object')

Unnamed: 0,gender,occupation,zip_code
count,943,943,943
unique,2,21,795
top,M,student,55414
freq,670,196,9


**Q.3** Describe all of the columns, regardless of type.

In [52]:
users.describe(include='all') # 等同 users.describe(include=['int64','object'])

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


**Q.4** Describe the `gender` Series from the `users` DataFrame.

In [54]:
print(users.gender.value_counts())
print(users.gender.value_counts(normalize=True))

M    670
F    273
Name: gender, dtype: int64
M    0.710498
F    0.289502
Name: gender, dtype: float64


**Q.5** Calculate the mean of the `age` column.

In [55]:
print(np.mean(users.age))
print(np.std(users.age))
print(np.std(users.age, ddof=1))

34.05196182396607
12.18627315093721
12.192739733059032


In [59]:
from scipy import stats
from scipy.stats import norm

In [60]:
def confidence_interval(data, con_interval):
    mean = np.mean(data)
    std = np.std(data,ddof=1) #ddof=1在数据量小时,有没有ddof差别很大！！！
    length = len(data)
    results = [mean-stats.norm.ppf((1+con_interval)/2)*std/length**0.5, 
               mean+stats.norm.ppf((1+con_interval)/2)*std/length**0.5]
    return print('mean: \n',mean) ,print('std:\n', std),\
print('z*std/sqrtn:\n',stats.norm.ppf((1+con_interval)/2)*std/length**0.5),print('results:\n',results)

In [62]:
confidence_interval(users.age, 0.95)

mean: 
 34.05196182396607
std:
 12.192739733059032
z*std/sqrtn:
 0.7782041600355599
results:
 [33.27375766393051, 34.830165984001624]


(None, None, None, None)

**Q.6** Calculate the counts of distinct values in the `gender` and `age` columns.

In [63]:
users.describe(include='all')

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


In [65]:
users['age'].unique()

array([24, 53, 23, 33, 42, 57, 36, 29, 39, 28, 47, 45, 49, 21, 30, 35, 40,
       26, 25, 32, 41,  7, 38, 20, 19, 27, 18, 22, 37, 16, 50, 31, 51, 17,
       48, 34, 43, 60, 55, 15, 61, 44, 54, 59, 46, 13, 52, 56, 14, 66, 62,
       11, 65, 68, 63, 64, 10, 73, 58, 69, 70])

In [67]:
users['age'].value_counts()[:5]

30    39
25    38
22    37
28    36
27    35
Name: age, dtype: int64

<a id='exercise-1'></a>
## Exercise #1

---

Load the `drinks.csv` data provided below.

**Perform the following:**
1. Print the head and tail.
2. Look at the index, columns, dtypes, and shape.
3. Assign the `beer_servings` column/Series to a variable.
4. Calculate summary statistics for `beer_servings`.
5. Calculate the median of `beer_servings`.
6. Count the values of unique categories in `continent`.
7. Print the dimensions of the `drinks` DataFrame.
8. Find the first three items of the value counts of the `occupation` column.

**BONUS:**
1. Create the `users` DataFrame from the `user_file` provided (which lacks a header row).
2. Supply a header: `['user_id', 'age', 'gender', 'occupation', 'zip_code']`.


In [76]:
! ls /Users/Paxton615/GA/resource-datasets/alcohol_by_country/

drinks.csv


In [78]:
drinks = pd.read_csv('/Users/Paxton615/GA/resource-datasets/alcohol_by_country/drinks.csv')
drinks.head(5)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [81]:
print(drinks.index)
print(drinks.columns)
print(drinks.dtypes)
print(drinks.shape)

RangeIndex(start=0, stop=193, step=1)
Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object
(193, 6)


In [83]:
beer_servings = drinks.beer_servings
beer_servings.describe()

count    193.000000
mean     106.160622
std      101.143103
min        0.000000
25%       20.000000
50%       76.000000
75%      188.000000
max      376.000000
Name: beer_servings, dtype: float64

In [84]:
print(np.median(beer_servings))

76.0


In [85]:
drinks['continent'].value_counts()

AF    53
EU    45
AS    44
OC    16
SA    12
Name: continent, dtype: int64

In [89]:
! ls /Users/Paxton615/GA/resource-datasets/users/

users.txt          users_original.txt


In [90]:
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_csv(local_user_file, sep='|', header=None, names=user_cols, index_col='user_id')

NameError: name 'local_user_file' is not defined

In [91]:
users.head(2)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043


<a id='filtering-sorting'></a>

## Filtering and Sorting DataFrames

---


<a id='filtering'></a>
### Boolean Filtering

**Q.1** Show users `age < 20` using a Boolean mask.

In [107]:
young_age = users.age<20
users[young_age]

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
35,36,19,F,student,93117
51,52,18,F,student,55105
56,57,16,M,none,84010
66,67,17,M,student,60402
...,...,...,...,...,...
871,872,19,F,student,74078
879,880,13,M,student,83702
886,887,14,F,student,27249
903,904,17,F,student,61073


In [110]:
young_student= (users.age<20) & (users.occupation=='student')
users[young_student]

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
35,36,19,F,student,93117
51,52,18,F,student,55105
66,67,17,M,student,60402
67,68,19,M,student,22904
...,...,...,...,...,...
862,863,17,M,student,60089
871,872,19,F,student,74078
879,880,13,M,student,83702
886,887,14,F,student,27249


**Q.2** Calculate the value counts of `occupation` for users `age < 20`.

In [112]:
users[young_age].occupation.value_counts()

student          64
other             4
none              3
writer            2
entertainment     2
salesman          1
artist            1
Name: occupation, dtype: int64

In [138]:
users['young_old']=''
users.loc[users['age']<20, 'young_old']='young'
users.loc[users['age']>=20, 'young_old']='old'

In [139]:
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code,young_old
0,1,24,M,technician,85711,old
1,2,53,F,other,94043,old
2,3,23,M,writer,32067,old
3,4,24,M,technician,43537,old
4,5,33,F,other,15213,old


In [140]:
users.young_old.value_counts()

old      866
young     77
Name: young_old, dtype: int64

**Q.3** Print the male users `age < 20`. 

**Q.4** Print the users `age < 10` or `age > 70`.

<a id='sorting'></a>
### Sorting

**Q.1** Return the `age` column sorted in ascending order.

**Q.2** Sort the `users` DataFrame by the `age` column (ascending).

In [29]:
# A:

**Q.3** Sort the `users` DataFrame by the `age` column in *descending* order.

In [30]:
# A:

<a id='exercise-2'></a>

## Exercise #2

---

**Using the `drinks` DataFrame from the previous exercise:**
1. Filter `drinks` to include only European countries.
2. Filter `drinks` to include only European countries with `wine_servings` > 300.
3. Calculate the mean `beer_servings` for all of Europe.
4. Determine which 10 countries have the highest `total_litres_of_pure_alcohol`.

**Using the `users` DataFrame:**
1. Sort `users` by occupation and then by `age` in a single command.
2. Filter `users` to only include doctors and lawyers without using a `|`.

> **Hint:** Look up `pandas.Series.isin`.

In [31]:
# A:

<a id='columns'></a>

## Renaming, Adding, and Removing Columns

---

<a id='renaming-columns'></a>
### Renaming Columns

**Q.1** Rename `beer_servings` as `beer` and `wine_servings` as `wine` in the `drinks` DataFrame, returning a *new* DataFrame.

In [32]:
# A:

**Q.2** Perform the same renaming for `drinks`, but in place.

In [33]:
# A:

In [34]:
# A:

**Q.3** Replace the column names of `drinks` with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`.

In [35]:
# A:

<a id='adding-columns'></a>
### Adding Columns

**Q.1** Make a `servings` column combines `beer`, `spirit`, and `wine`.

In [36]:
# A:

**Q.2** Make an `mL` column that is the `liters` column multiplied by 1,000.

In [37]:
# A:

<a id='removing-columns'></a>
### Removing Columns

**Q.1** Remove the `mL` column, returning a new DataFrame.

In [38]:
# A:

**Q.2** Remove the `mL` and `servings` columns from `drinks` in place.

In [39]:
# A:

<a id='missing'></a>
## Handling Missing Values

---

<a id='find-missing'></a>
### Finding Missing Values

**Q.1** Include missing values from the `continent` variable in the `drinks` DataFrame when counting unique values.

In [40]:
# A:

**Q.2** Create a Boolean Series indicating which values are missing or not missing in `continents`.

In [41]:
# A:

**Q.3** Subset to rows in `drinks` where `continent` is missing and where `continent` is not missing.

In [42]:
# A:

**Q.4** Calculate the sum of `drinks`' *columns* and the sum of its *rows*.

In [43]:
# A:

In [44]:
# A:

**Side Note: Adding Booleans**
```python
pd.Series([True, False, True])  # Creates a Boolean Series
pd.Series([True, False, True]).sum()  # Converts `False` to 0 and `True` to 1
```

**Q.5** FInd the number of missing values by column in `drinks`.

In [45]:
# A:

<a id='drop-missing'></a>
### Dropping Missing Values

**Q.1** Drop rows where *ANY* values are missing in `drinks` (returning a new DataFrame).  
_Make sure you know ahead of time exactly what you'll be dropping._

In [46]:
# A:

**Q.2** Drop rows only where *ALL* values are missing in `drinks`.

In [47]:
# A:

<a id='fill-missing'></a>
### Filling in Missing Values

What's up with these `NaN` continents?

In [48]:
# A:

_You probably figured it out already, but all of these continents are in North America (`NA`), and, when read in, were misinterpreted as a `null` or `NaN` value._

**Q.1** Fill in the missing values of the `continent` column using string `NA`.

In [49]:
# A:

**Q.2** Turn off the missing value filter when loading the `drinks` `.csv`.

In [50]:
# A: