#### Copyright 2019 Google LLC.

In [1]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Introduction to Pandas

Now that you had a quick introduction to Python, let's talk about Pandas. 

[*Pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.
Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials.

This introduction is limited to the DataFrame and Series objects and the methods on those objects that support data analysis.

## Overview

### Learning Objectives

  * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library
  * Access and manipulate data within a `DataFrame` and `Series`
  * Import CSV data into a *pandas* `DataFrame`
  * Reindex a `DataFrame` to shuffle data

### Prerequisites

* Introduction to Python

### Estimated Duration

90 minutes

## Basic Concepts

The following line imports the *pandas* API and prints the API version:

In [2]:
import pandas as pd
pd.__version__

'0.24.2'

The primary data structures in *pandas* are implemented as two classes:

  * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.
  * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html).

One way to create a `Series` is to construct a `Series` object. For example:

In [3]:
pd.Series(['San Francisco', 'San Jose', 'Sacramento'])

0    San Francisco
1         San Jose
2       Sacramento
dtype: object

`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:

In [4]:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

pd.DataFrame({ 'City name': city_names, 'Population': population })

Unnamed: 0,City name,Population
0,San Francisco,852469
1,San Jose,1015785
2,Sacramento,485199


But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:

In [5]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:

In [6]:
california_housing_dataframe.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:

In [7]:
california_housing_dataframe.hist('housing_median_age')

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10d525748>]],
      dtype=object)

## Accessing Data

You can access `DataFrame` data using familiar Python dict/list operations:

In [8]:
cities = pd.DataFrame({ 'City name': city_names, 'Population': population })
print(type(cities['City name']))
cities['City name']

<class 'pandas.core.series.Series'>


0    San Francisco
1         San Jose
2       Sacramento
Name: City name, dtype: object

In [9]:
print(type(cities['City name'][1]))
cities['City name'][1]

<class 'str'>


'San Jose'

In [10]:
print(type(cities[0:2]))
cities[0:2]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,City name,Population
0,San Francisco,852469
1,San Jose,1015785


In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here.

## Manipulating Data

You may apply Python's basic arithmetic operations to `Series`. For example:

In [11]:
population / 1000.

0     852.469
1    1015.785
2     485.199
dtype: float64

[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:

In [12]:
import numpy as np

np.log(population)

0    13.655892
1    13.831172
2    13.092314
dtype: float64

For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), 
`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.

The example below creates a new `Series` that indicates whether `population` is over one million:

In [13]:
population.apply(lambda val: val > 1000000)

0    False
1     True
2    False
dtype: bool


Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:

In [14]:
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities

Unnamed: 0,City name,Population,Area square miles,Population density
0,San Francisco,852469,46.87,18187.945381
1,San Jose,1015785,176.53,5754.17776
2,Sacramento,485199,97.92,4955.055147


## Challenge

Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:

  * The city is named after a saint.
  * The city has an area greater than 50 square miles.

**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.

**Hint:** "San" in Spanish means "saint."

In [15]:
# Your code here
cities['City name and area check'] = cities['City name'].apply(lambda name: name.startswith('San'))& (cities['Area square miles'] > 50)
cities

Unnamed: 0,City name,Population,Area square miles,Population density,City name and area check
0,San Francisco,852469,46.87,18187.945381,False
1,San Jose,1015785,176.53,5754.17776,True
2,Sacramento,485199,97.92,4955.055147,False


### Solution

Click below for a solution.

In [16]:
cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))
cities

Unnamed: 0,City name,Population,Area square miles,Population density,City name and area check,Is wide and has saint name
0,San Francisco,852469,46.87,18187.945381,False,False
1,San Jose,1015785,176.53,5754.17776,True,True
2,Sacramento,485199,97.92,4955.055147,False,False


## Indexes
Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. 

By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered.

In [17]:
city_names.index

RangeIndex(start=0, stop=3, step=1)

In [18]:
cities.index

RangeIndex(start=0, stop=3, step=1)

Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:

In [19]:
cities.reindex([2, 0, 1])

Unnamed: 0,City name,Population,Area square miles,Population density,City name and area check,Is wide and has saint name
2,Sacramento,485199,97.92,4955.055147,False,False
0,San Francisco,852469,46.87,18187.945381,False,False
1,San Jose,1015785,176.53,5754.17776,True,True


Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.
Try running the following cell multiple times!

In [20]:
cities.reindex(np.random.permutation(cities.index))

Unnamed: 0,City name,Population,Area square miles,Population density,City name and area check,Is wide and has saint name
1,San Jose,1015785,176.53,5754.17776,True,True
0,San Francisco,852469,46.87,18187.945381,False,False
2,Sacramento,485199,97.92,4955.055147,False,False


For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects).

## Challenge

The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed? Does reindexing effect the underlying data structure or does it make a copy?

In [21]:
# Your code here
cities.reindex([2, 0, 1,4,3])

Unnamed: 0,City name,Population,Area square miles,Population density,City name and area check,Is wide and has saint name
2,Sacramento,485199.0,97.92,4955.055147,False,False
0,San Francisco,852469.0,46.87,18187.945381,False,False
1,San Jose,1015785.0,176.53,5754.17776,True,True
4,,,,,,
3,,,,,,


### Solution

Click below for the solution.

If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these "missing" indices and populate all corresponding columns with `NaN` values:

In [22]:
cities.reindex([0, 4, 5, 2])

Unnamed: 0,City name,Population,Area square miles,Population density,City name and area check,Is wide and has saint name
0,San Francisco,852469.0,46.87,18187.945381,False,False
4,,,,,,
5,,,,,,
2,Sacramento,485199.0,97.92,4955.055147,False,False


This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex
documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example
in which the index values are browser names).

In this case, allowing "missing" indices makes it easy to reindex using an external list, as you don't have to worry about
sanitizing the input.

# Exercises

## Exercise 1

1.   Visit [data.gov's CSV datasets](https://catalog.data.gov/dataset?res_format=CSV) and find "Demographic Statistics By Zip Code: City of New York — Demographic statistics broken down by zip code".
2.   Find the CSV download URL for the dataset and use that URL to download the CSV into a DataFrame.
3.   Use *pandas* to "describe" the data.

### Student Solution

In [23]:
# Your code goes here
newyork_demographics = pd.read_csv("https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv?accessType=DOWNLOAD", sep=",")
newyork_demographics.describe()

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,...,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
count,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,...,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0
mean,11127.173729,17.661017,10.29661,0.243983,7.364407,0.201017,0.0,0.0,17.661017,44.491525,...,17.661017,44.487288,5.974576,0.139195,11.686441,0.305805,0.0,0.0,17.661017,44.491525
std,1052.43179,43.279737,28.189121,0.333677,18.880888,0.295731,0.0,0.0,43.279737,49.801264,...,43.279737,49.796563,16.803729,0.227979,29.369458,0.380602,0.0,0.0,43.279737,49.801264
min,10001.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10451.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,11216.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,11422.25,13.0,6.0,0.515,4.0,0.4,0.0,0.0,13.0,100.0,...,13.0,100.0,3.25,0.2525,8.0,0.6625,0.0,0.0,13.0,100.0
max,20459.0,272.0,194.0,1.0,157.0,1.0,0.0,0.0,272.0,100.0,...,272.0,100.0,155.0,1.0,206.0,1.0,0.0,0.0,272.0,100.0


## Exercise 2

Use the *pandas* `head` method to look at the first 20 rows of data.

### Student Solution

In [24]:
# Your code goes here
newyork_demographics.head(20)

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,...,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
0,10001,44,22,0.5,22,0.5,0,0,44,100,...,44,100,20,0.45,24,0.55,0,0,44,100
1,10002,35,19,0.54,16,0.46,0,0,35,100,...,35,100,2,0.06,33,0.94,0,0,35,100
2,10003,1,1,1.0,0,0.0,0,0,1,100,...,1,100,0,0.0,1,1.0,0,0,1,100
3,10004,0,0,0.0,0,0.0,0,0,0,0,...,0,0,0,0.0,0,0.0,0,0,0,0
4,10005,2,2,1.0,0,0.0,0,0,2,100,...,2,100,0,0.0,2,1.0,0,0,2,100
5,10006,6,2,0.33,4,0.67,0,0,6,100,...,6,100,0,0.0,6,1.0,0,0,6,100
6,10007,1,0,0.0,1,1.0,0,0,1,100,...,1,100,1,1.0,0,0.0,0,0,1,100
7,10009,2,0,0.0,2,1.0,0,0,2,100,...,2,100,0,0.0,2,1.0,0,0,2,100
8,10010,0,0,0.0,0,0.0,0,0,0,0,...,0,0,0,0.0,0,0.0,0,0,0,0
9,10011,3,2,0.67,1,0.33,0,0,3,100,...,3,100,0,0.0,3,1.0,0,0,3,100


## Exercise 3

Use the *pandas* `tail` method to look at the last 20 rows of data.

### Student Solution

In [25]:
# Your code goes here
newyork_demographics.tail(20)

Unnamed: 0,JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,...,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
216,12701,87,49,0.56,38,0.44,0,0,87,100,...,87,100,18,0.21,69,0.79,0,0,87,100
217,12733,13,6,0.46,7,0.54,0,0,13,100,...,13,100,2,0.15,11,0.85,0,0,13,100
218,12734,252,170,0.67,82,0.33,0,0,252,100,...,252,100,61,0.24,191,0.76,0,0,252,100
219,12737,35,34,0.97,1,0.03,0,0,35,100,...,35,100,20,0.57,15,0.43,0,0,35,100
220,12750,37,36,0.97,1,0.03,0,0,37,100,...,37,100,21,0.57,16,0.43,0,0,37,100
221,12751,7,2,0.29,5,0.71,0,0,7,100,...,7,100,2,0.29,5,0.71,0,0,7,100
222,12754,134,64,0.48,70,0.52,0,0,134,100,...,134,100,27,0.2,107,0.8,0,0,134,100
223,12758,82,50,0.61,32,0.39,0,0,82,100,...,82,100,24,0.29,58,0.71,0,0,82,100
224,12759,11,0,0.0,11,1.0,0,0,11,100,...,11,100,8,0.73,3,0.27,0,0,11,100
225,12763,11,11,1.0,0,0.0,0,0,11,100,...,11,100,8,0.73,3,0.27,0,0,11,100


## Exercise 4

There are over 40 columns of data in this data frame. It is difficult to read them horizontally. Write some *pandas* code that prints the name of each column (*pandas* series). Print one name per line.

### Student Solution

In [26]:
# Your code goes here
for column in newyork_demographics.columns:
  print(column)

JURISDICTION NAME
COUNT PARTICIPANTS
COUNT FEMALE
PERCENT FEMALE
COUNT MALE
PERCENT MALE
COUNT GENDER UNKNOWN
PERCENT GENDER UNKNOWN
COUNT GENDER TOTAL
PERCENT GENDER TOTAL
COUNT PACIFIC ISLANDER
PERCENT PACIFIC ISLANDER
COUNT HISPANIC LATINO
PERCENT HISPANIC LATINO
COUNT AMERICAN INDIAN
PERCENT AMERICAN INDIAN
COUNT ASIAN NON HISPANIC
PERCENT ASIAN NON HISPANIC
COUNT WHITE NON HISPANIC
PERCENT WHITE NON HISPANIC
COUNT BLACK NON HISPANIC
PERCENT BLACK NON HISPANIC
COUNT OTHER ETHNICITY
PERCENT OTHER ETHNICITY
COUNT ETHNICITY UNKNOWN
PERCENT ETHNICITY UNKNOWN
COUNT ETHNICITY TOTAL
PERCENT ETHNICITY TOTAL
COUNT PERMANENT RESIDENT ALIEN
PERCENT PERMANENT RESIDENT ALIEN
COUNT US CITIZEN
PERCENT US CITIZEN
COUNT OTHER CITIZEN STATUS
PERCENT OTHER CITIZEN STATUS
COUNT CITIZEN STATUS UNKNOWN
PERCENT CITIZEN STATUS UNKNOWN
COUNT CITIZEN STATUS TOTAL
PERCENT CITIZEN STATUS TOTAL
COUNT RECEIVES PUBLIC ASSISTANCE
PERCENT RECEIVES PUBLIC ASSISTANCE
COUNT NRECEIVES PUBLIC ASSISTANCE
PERCENT NRECEIV

## Exercise 5

1.  Zip codes should be unique in this dataset. Verify that is true.
1.  Reindex the data frame using zip code as the index.
1.  Use *pandas* to write code to find all of the zip codes where there is no data (all zeros) in all columns except the zip code.
1.  Remove these rows of data from the DataFrame.

### Student Solution

In [27]:
# Your code goes here
newyork_demographics['JURISDICTION NAME'].is_unique


True

In [28]:
newyork_demographics.set_index('JURISDICTION NAME',inplace = True)

In [29]:
newyork_demographics.loc[(newyork_demographics!=0).any(axis=1)]


Unnamed: 0_level_0,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,COUNT PACIFIC ISLANDER,...,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
JURISDICTION NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10001,44,22,0.50,22,0.50,0,0,44,100,0,...,44,100,20,0.45,24,0.55,0,0,44,100
10002,35,19,0.54,16,0.46,0,0,35,100,0,...,35,100,2,0.06,33,0.94,0,0,35,100
10003,1,1,1.00,0,0.00,0,0,1,100,0,...,1,100,0,0.00,1,1.00,0,0,1,100
10005,2,2,1.00,0,0.00,0,0,2,100,0,...,2,100,0,0.00,2,1.00,0,0,2,100
10006,6,2,0.33,4,0.67,0,0,6,100,0,...,6,100,0,0.00,6,1.00,0,0,6,100
10007,1,0,0.00,1,1.00,0,0,1,100,0,...,1,100,1,1.00,0,0.00,0,0,1,100
10009,2,0,0.00,2,1.00,0,0,2,100,0,...,2,100,0,0.00,2,1.00,0,0,2,100
10011,3,2,0.67,1,0.33,0,0,3,100,0,...,3,100,0,0.00,3,1.00,0,0,3,100
10013,8,1,0.13,7,0.88,0,0,8,100,0,...,8,100,1,0.13,7,0.88,0,0,8,100
10016,17,12,0.71,5,0.29,0,0,17,100,0,...,17,100,9,0.53,8,0.47,0,0,17,100


## Exercise 6

Examine the dataset and try to do some research to find out how the data was collected. Try to identify problems with the dataset related to data quality, data completeness, and bias.

### Student Solution

In [30]:
newyork_demographics.describe()

Unnamed: 0,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE,COUNT MALE,PERCENT MALE,COUNT GENDER UNKNOWN,PERCENT GENDER UNKNOWN,COUNT GENDER TOTAL,PERCENT GENDER TOTAL,COUNT PACIFIC ISLANDER,...,COUNT CITIZEN STATUS TOTAL,PERCENT CITIZEN STATUS TOTAL,COUNT RECEIVES PUBLIC ASSISTANCE,PERCENT RECEIVES PUBLIC ASSISTANCE,COUNT NRECEIVES PUBLIC ASSISTANCE,PERCENT NRECEIVES PUBLIC ASSISTANCE,COUNT PUBLIC ASSISTANCE UNKNOWN,PERCENT PUBLIC ASSISTANCE UNKNOWN,COUNT PUBLIC ASSISTANCE TOTAL,PERCENT PUBLIC ASSISTANCE TOTAL
count,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,...,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0,236.0
mean,17.661017,10.29661,0.243983,7.364407,0.201017,0.0,0.0,17.661017,44.491525,0.025424,...,17.661017,44.487288,5.974576,0.139195,11.686441,0.305805,0.0,0.0,17.661017,44.491525
std,43.279737,28.189121,0.333677,18.880888,0.295731,0.0,0.0,43.279737,49.801264,0.204705,...,43.279737,49.796563,16.803729,0.227979,29.369458,0.380602,0.0,0.0,43.279737,49.801264
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,13.0,6.0,0.515,4.0,0.4,0.0,0.0,13.0,100.0,0.0,...,13.0,100.0,3.25,0.2525,8.0,0.6625,0.0,0.0,13.0,100.0
max,272.0,194.0,1.0,157.0,1.0,0.0,0.0,272.0,100.0,2.0,...,272.0,100.0,155.0,1.0,206.0,1.0,0.0,0.0,272.0,100.0


*Your anser goes here (English, not Python!)*

One of the biggest problems with the dataset is bias of the data. Since we have so many districts according to zip codes, the number of sampled participants from each district might be largely different and can not provide an unbias overview of the new york statistics. 
Even though the count for female and male paritipants is the same for the whole NY city, each district potentially has large difference between female and male participants. 

For the data quality, the size of the sampled dataset plays a big role. The sampled data is relatively small compared to the actual New york city population. It might not provide an accurate reflection of the population statistics since the sample size is small. If we have a larger dataset, it might provide different aspects on citizenship, race and gender. 

For data completeness, it can also include age, birth place and other features for a more comprehensive analysis of the population such as geographic fludity and etc. 

