# Lab Assignment 9: Data Management Using `pandas`, Part 2
## DS 6001: Practice and Application of Data Science

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

## Problem 0
Import the following libraries:

In [1]:
import numpy as np
import pandas as pd

## Problem 1
In the first part of this lab, the goal is to merge data from the United Nations World Health Organization (https://www.who.int/who-un/en/) with data from the Varieties of Democracy Project (https://www.v-dem.net/en/). The UN-WHO studies health outcomes in a cross-national context, and V-Dem studies the quality of democracy as it changes across countries and over time. We would want to merge these two datasets together if we wanted to study whether democratic quality can predict health outcomes.

The UN data contains cross-national time series data from the United Nations and World Health Organization, and includes three features:

* The number of physicians per 1000 people
* The percent of the population that is malnourished
* Health expenditure per capita

The VDem data comes from the Varieties of Democracy project, which aims to measure the quality of democracy and the amount of corruption in different countries over time (https://www.v-dem.net/en/data/data-version-8/). This data file contains indices regarding a country’s democractic quality, level of civil liberites, and corruption. It also contains a binary indicator that separates countries into democratic and nondemocratic states, and it includes a categorizaton of the corruption scale.

The URLs for the two datasets are:

In [2]:
undata_url = "https://github.com/jkropko/DS-6001/raw/master/localdata/UNdata.csv"
VDem_url = "https://github.com/jkropko/DS-6001/raw/master/localdata/vdem.csv"

### Part a
Load both CSV files. Make sure to check whether there are rows that should not be included in the dataframe, and whether there are missing codes that should be replaced with `NaN`. Fix these problems at the data loading stage, if you can. (Don't worry about column names or category labels yet.) Also, the UN data covers the years 1960-2014, and the VDem data covers the years 1960-2015. To make the timeframe match up, delete rows in the VDem data from 2015. (1 point)

In [3]:
ud = pd.read_csv(undata_url).replace("..",np.nan).iloc[:-5,:]
ud.head(5)

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,0.0348442494869232,,,,,0.0634277984499931,...,0.136,0.146,0.145,0.175,0.194,0.234,0.225,0.266,,
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,0.276291221380234,,,,,0.48128342628479,...,1.15,1.146,,1.144,1.132,1.113,1.145,1.145,,
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,0.173148155212402,,,,,0.116413652896881,...,,1.207,,,1.207,,,,,
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,,,,,,,...,3.64,3.716,,3.912,4.0,,,,,


In [4]:
# ud.dtypes

In [5]:
vd = pd.read_csv(VDem_url).iloc[:,1:]
vd = vd[vd["year"]!=2015]
vd.head(5)

Unnamed: 0,country_name,country_id,country_text_id,year,historical_date,codingstart,gapstart,gapend,codingend,COWcode,...,v2xcs_ccsi_codehigh,v2xcs_ccsi_codelow,v2xps_party,v2xps_party_codehigh,v2xps_party_codelow,v2x_gender,v2x_gender_codehigh,v2x_gender_codelow,v2x_gencl,v2x_gencl_codehigh
0,Mexico,3,MEX,1960,1960-01-01,1900,,,2014,70.0,...,0.451123,0.170201,0.681416,0.811379,0.524055,0.347498,0.42127,0.273726,0.555367,0.714971
1,Mexico,3,MEX,1961,1961-01-01,1900,,,2014,70.0,...,0.461693,0.175715,0.681416,0.811379,0.524055,0.344214,0.417813,0.270614,0.555367,0.714971
2,Mexico,3,MEX,1962,1962-01-01,1900,,,2014,70.0,...,0.461693,0.175715,0.681416,0.811379,0.524055,0.344214,0.417813,0.270614,0.555367,0.714971
3,Mexico,3,MEX,1963,1963-01-01,1900,,,2014,70.0,...,0.461693,0.175715,0.681416,0.811379,0.524055,0.344214,0.417813,0.270614,0.555367,0.714971
4,Mexico,3,MEX,1964,1964-01-01,1900,,,2014,70.0,...,0.461693,0.175715,0.681416,0.811379,0.524055,0.356873,0.428861,0.284885,0.555367,0.714971


In [6]:
# pd.set_option('display.max_rows', None)
# vd.dtypes

In [7]:
# pd.reset_option('display.max_rows', None)

### Part b
The UN data contain certain rows that refer to groups of countries instead of to individual countries. Here’s a list of these non-countries:

In [8]:
noncountries = ['Arab World',  'Caribbean small states',  'Central Europe and the Baltics', 
    'Early-demographic dividend',  'East Asia & Pacific', 'East Asia & Pacific (excluding high income)', 
    'East Asia & Pacific (IDA & IBRD countries)', 'Euro area', 'Europe & Central Asia', 
    'Europe & Central Asia (excluding high income)', 'Europe & Central Asia (IDA & IBRD countries)', 'European Union', 
    'Fragile and conflict affected situations', 'Heavily indebted poor countries (HIPC)', 
    'High income', 'Late-demographic dividend', 'Latin America & Caribbean', 
    'Latin America & Caribbean (excluding high income)', 
    'Latin America & the Caribbean (IDA & IBRD countries)', 'Least developed countries: UN classification', 
    'Low & middle income', 'Low income', 'Lower middle income', 
    'Middle East & North Africa', 'Middle East & North Africa (excluding high income)',
    'Middle East & North Africa (IDA & IBRD countries)', 
    'Middle income', 'North America', 'OECD members', 
    'Other small states', 'Pacific island small states', 'Post-demographic dividend', 
    'Pre-demographic dividend', 'Small states', 'South Asia', 
    'South Asia (IDA & IBRD)', 'Sub-Saharan Africa', 'Sub-Saharan Africa (excluding high income)', 
    'Sub-Saharan Africa (IDA & IBRD countries)', 'Upper middle income', 'World']

We can use `.query()` to remove the non-countries from the data, but in this case there are complications due to the space in the name of the column `Country Name` and the use of an external list. So here let's use an alternative method:

First, apply the `.isin(noncountries)` method to the `Country Name` column of the UN data to create a series of values that are `True` if the `Country Name` on a row is one of the non-countries, and `False` otherwise. Second, use the `~` operator to negate the logical values: turn `True` to `False` and vice versa. Finally, pass this logical series to the `.loc[]` attribute of the dataframe to drop the rows that refer to these noncountries from the UN data. (1 point)

(If you wanted to use `.query()`, you would first need to rename `Country Name` to remove the space, then you can use an `@` in front of `noncountries` to refer to the external list. But for this problem follow the instructions listed above.)

In [9]:
ud = ud.loc[~ud["Country Name"].isin(noncountries)]
ud.head(5)

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,0.0348442494869232,,,,,0.0634277984499931,...,0.136,0.146,0.145,0.175,0.194,0.234,0.225,0.266,,
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,0.276291221380234,,,,,0.48128342628479,...,1.15,1.146,,1.144,1.132,1.113,1.145,1.145,,
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,0.173148155212402,,,,,0.116413652896881,...,,1.207,,,1.207,,,,,
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,,,,,,,...,3.64,3.716,,3.912,4.0,,,,,


### Part c
Reshape the UN data to move the years from the columns to the rows. (Once the years are in the rows, they will have values such as "1960 [YR1960]".) (2 points)

In [10]:
ud = pd.melt(ud, 
              id_vars=['Series Name', 'Series Code', 'Country Name', 'Country Code'], 
              var_name='Year', 
              value_name='data')
ud.head(5)

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,Year,data
0,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Afghanistan,AFG,1960 [YR1960],0.0348442494869232
1,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Albania,ALB,1960 [YR1960],0.276291221380234
2,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Algeria,DZA,1960 [YR1960],0.173148155212402
3,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,American Samoa,ASM,1960 [YR1960],
4,"Physicians (per 1,000 people)",SH.MED.PHYS.ZS,Andorra,ADO,1960 [YR1960],


In [11]:
len(ud)

36456

### Part d
Rename the `variable` column to `year`. Then use string methods to remove the ends such as "[YR1960]" from the values of the new `year` column and convert the column to an integer data type.

Also, for whatever reason, real world data often contains multiple variables that are just different representations of the same information. In this case, the `Series Name` and `Series Code` variables tell us exactly the same thing, and the `Country Name` and `Country Code` variables tell us exactly the same thing. Unless I have a very good reason to keep both, I generally prefer to drop variables that are redundant and coded in a less helpful way. So drop `Series Code` and `Country Code`. (2 points)

In [12]:
ud = ud.rename(columns={"Year":"year"})

In [13]:
ud.year = [x[0] for x in ud.year.str.strip().str.split(" ")]
ud.year.astype(int)

0        1960
1        1960
2        1960
3        1960
4        1960
         ... 
36451    2015
36452    2015
36453    2015
36454    2015
36455    2015
Name: year, Length: 36456, dtype: int32

In [14]:
ud = ud.iloc[:,[0,2,4,5]]
ud.head(5)

Unnamed: 0,Series Name,Country Name,year,data
0,"Physicians (per 1,000 people)",Afghanistan,1960,0.0348442494869232
1,"Physicians (per 1,000 people)",Albania,1960,0.276291221380234
2,"Physicians (per 1,000 people)",Algeria,1960,0.173148155212402
3,"Physicians (per 1,000 people)",American Samoa,1960,
4,"Physicians (per 1,000 people)",Andorra,1960,


In [15]:
len(ud)

36456

### Part e
Reshape the data to move the values of `Series Name` to separate columns. Make sure all of the columns exist in the dataframe after reshaping and are not stored in a row index or multi-index. Then rename the columns so that all of the columns have concise and descriptive names. (2 points)

In [16]:
ud = pd.pivot_table(ud,
                     index = ["Country Name", "year"],
                     columns = ["Series Name"],
                     values = ["data"]).reset_index()
ud

Unnamed: 0_level_0,Country Name,year,data,data,data
Series Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Health expenditure per capita (current US$),"Physicians (per 1,000 people)",Prevalence of undernourishment (% of population)
0,Afghanistan,1960,,0.034844,
1,Afghanistan,1965,,0.063428,
2,Afghanistan,1970,,0.0649,
3,Afghanistan,1981,,0.077,
4,Afghanistan,1986,,0.1831,
...,...,...,...,...,...
6396,Zimbabwe,2011,48.46958,0.083,33.5
6397,Zimbabwe,2012,57.253763,,33.2
6398,Zimbabwe,2013,62.309228,,33.5
6399,Zimbabwe,2014,57.710452,,34.0


In [17]:
ud.columns = ["country_name", "year", "health_exp", "pys_p1000", "perc_malnourished"]
ud.columns

Index(['country_name', 'year', 'health_exp', 'pys_p1000', 'perc_malnourished'], dtype='object')

In [18]:
ud.head()

Unnamed: 0,country_name,year,health_exp,pys_p1000,perc_malnourished
0,Afghanistan,1960,,0.034844,
1,Afghanistan,1965,,0.063428,
2,Afghanistan,1970,,0.0649,
3,Afghanistan,1981,,0.077,
4,Afghanistan,1986,,0.1831,


In [19]:
len(ud)

6401

### Part f
Next we are going to join the cleaned UN data with the VDem data. In a perfect world, both datasets would include a shared numeric country ID field that we can use to match countries in one dataset to countries in the other. Unfortunately the UN data identifies the countries only by name. Worse still, while there is a big overlap the two datasets cover different sets of countries.

First decide whether this merge is a one-to-one, one-to-many, many-to-one, or many-to-many merge and describe your rationale in words.

Then perform a test merge that checks whether your expectation that the merge is one-to-one, one-to-many, many-to-one, or many-to-many is confirmed, and reports whether each row is matched, appears only in the UN data, or appears only in the VDem data. Use the `.unique()` or `.value_counts()` method to display the names of the countries that are not matched. (2 points)

#### Deciding type of merge
assumption: left-ud, right vd


I would do many-to-many because the ud data might have countries that the vd data doesn't and vice versa.

In [20]:
testdf = pd.merge(ud, vd,
                  on="country_name",
                  how="outer",
                  validate="many_to_many",
                  indicator="matched")

In [21]:
not testdf.duplicated().any()

True

In [22]:
testdf["matched"].value_counts()

matched
both          250774
left_only       1462
right_only      1220
Name: count, dtype: int64

In [23]:
## country names present in ud data that's not in vd
testdf.query("matched=='left_only'")["country_name"].unique()

array(['American Samoa', 'Andorra', 'Antigua and Barbuda', 'Aruba',
       'Bahamas, The', 'Bahrain', 'Belize', 'Bermuda',
       'Brunei Darussalam', 'Cabo Verde', 'Cayman Islands',
       'Channel Islands', 'Congo, Dem. Rep.', 'Congo, Rep.',
       "Cote d'Ivoire", 'Dominica', 'Egypt, Arab Rep.',
       'Equatorial Guinea', 'French Polynesia', 'Gambia, The',
       'Greenland', 'Grenada', 'Guam', 'Hong Kong SAR, China',
       'Iran, Islamic Rep.', 'Kiribati', 'Korea, Dem. People’s Rep.',
       'Korea, Rep.', 'Kuwait', 'Kyrgyz Republic', 'Lao PDR',
       'Luxembourg', 'Macao SAR, China', 'Macedonia, FYR', 'Malta',
       'Marshall Islands', 'Micronesia, Fed. Sts.', 'Monaco', 'Myanmar',
       'Nauru', 'New Caledonia', 'Northern Mariana Islands', 'Oman',
       'Palau', 'Puerto Rico', 'Russian Federation', 'Samoa',
       'San Marino', 'Singapore', 'Slovak Republic',
       'St. Kitts and Nevis', 'St. Lucia',
       'St. Vincent and the Grenadines', 'Syrian Arab Republic',
       'T

In [24]:
## country names present in vd data that's not in ud data
testdf.query("matched=='right_only'")["country_name"].unique()

array(['Burma_Myanmar', 'Russia', 'Egypt', 'Yemen', 'South Yemen',
       'Vietnam_Democratic Republic of', 'Vietnam_Republic of',
       'Korea_North', 'Korea_South', 'Kosovo', 'Taiwan', 'Venezuela',
       'Ivory Coast', 'Cape Verde', 'East Timor', 'Iran', 'Syria',
       'Congo_Democratic Republic of', 'Congo_Republic of the', 'Gambia',
       'Kyrgyzstan', 'Laos', 'Palestine_West_Bank',
       'German Democratic Republic', 'Palestine_Gaza', 'Somaliland',
       'Macedonia', 'Slovakia'], dtype=object)

In [25]:
len(testdf)

253456

### Part g
There are many unmatched rows in this merge. There are three reasons why rows failed to match:
* Differences in geographical coverage: for example, the VDem data includes Taiwan, but the UN data does not
* Differences in time coverage: for example, the UN data includes records for France every year from 1970 through 2014, and VDem includes rows for France from 1960 to 2012, leaving 12 rows for France without matching years
* Differences in spelling: for example, South Korea is called "Korea, Rep." in the UN data and "Korea_South" in the VDem data.

We can't do anything about differences in geographic or temporal coverage. But we can recode some country names to account for differences in spelling and to match more rows that should match. Here is a list of differently spelled countries:

* "Burma_Myanmar" in VDem is "Myanmar" in the UN data
* "Cape Verde" in VDem is "Cabo Verde" in the UN data
* "Congo_Democratic Republic of" in VDem is "Congo, Dem. Rep." in the UN data
* "Congo_Republic of the" in VDem is "Congo, Rep." in the UN data
* "East Timor" in VDem is "Timor-Leste" in the UN data
* "Egypt" in VDem is "Egypt, Arab Rep." in the UN data
* "Gambia" in VDem is "Gambia, The" in the UN data
* "Iran" in VDem is "Iran, Islamic Rep." in the UN data
* "Ivory Coast" in VDem is "Cote d’Ivoire" in the UN data
* "Korea_North" in VDem is "Korea, Dem. People’s Rep." in the UN data
* "Korea_South" in VDem is "Korea, Rep." in the UN data
* "Kyrgyzstan" in VDem is "Kyrgyz Republic" in the UN data
* "Laos" in VDem is "Lao PDR" in the UN data
* "Macedonia" in VDem is "Macedonia, FYR" in the UN data
* "Palestine_West_Bank" in VDem is "West Bank and Gaza" in the UN Data (there is also "Palestine_Gaza" in VDem, but since the UN combines data for the West Bank and Gaza, let's just use "Palestine_West_Bank" for this assignment)
* "Russia" in VDem is "Russian Federation" in the UN data
* "Slovakia" in VDem is "Slovak Republic" in the UN data
* "Syria" in VDem is "Syrian Arab Republic" in the UN data
* "Venezuela" in VDem is "Venezuela, RB" in the UN data
* "Vietnam_Democratic Republic of" in VDem is "Vietnam" in the UN data
* "Yemen" in VDem is "Yemen, Rep." in the UN data

Recode the country names listed above in one of the two dataframes to match the names in the other dataframe. Then perform an inner join of the two dataframes. Some rows will be dropped because of differences in coverage, but no rows will be dropped because of differences in spelling. (2 points)

In [26]:
redo_countries = {
    "Burma_Myanmar": "Myanmar",
    "Cape Verde": "Cabo Verde",
    "Congo_Democratic Republic of": "Congo, Dem. Rep.",
    "Congo_Republic of the": "Congo, Rep.",
    "East Timor": "Timor-Leste",
    "Egypt": "Egypt, Arab Rep.",
    "Gambia": "Gambia, The",
    "Iran": "Iran, Islamic Rep.",
    "Ivory Coast": "Cote d’Ivoire",
    "Korea_North": "Korea, Dem. People’s Rep.",
    "Korea_South": "Korea, Rep.",
    "Kyrgyzstan": "Kyrgyz Republic",
    "Laos": "Lao PDR",
    "Macedonia": "Macedonia, FYR",
    "Palestine_West_Bank": "West Bank and Gaza",
    "Palestine_Gaza": "West Bank and Gaza",
    "Russia": "Russian Federation",
    "Slovakia": "Slovak Republic",
    "Syria": "Syrian Arab Republic",
    "Venezuela": "Venezuela, RB",
    "Vietnam_Democratic Republic of": "Vietnam",
    "Yemen": "Yemen, Rep."
}

In [27]:
vd.country_name = vd.country_name.replace(redo_countries)
vd.country_name.unique()

array(['Mexico', 'Suriname', 'Sweden', 'Switzerland', 'Ghana',
       'South Africa', 'Japan', 'Myanmar', 'Russian Federation',
       'Albania', 'Egypt, Arab Rep.', 'Yemen, Rep.', 'Colombia', 'Poland',
       'Brazil', 'United States', 'Portugal', 'El Salvador',
       'South Yemen', 'Bangladesh', 'Bolivia', 'Haiti', 'Honduras',
       'Mali', 'Pakistan', 'Peru', 'Senegal', 'South Sudan', 'Sudan',
       'Vietnam', 'Vietnam_Republic of', 'Afghanistan', 'Argentina',
       'Ethiopia', 'India', 'Kenya', 'Korea, Dem. People’s Rep.',
       'Korea, Rep.', 'Kosovo', 'Lebanon', 'Nigeria', 'Philippines',
       'Tanzania', 'Taiwan', 'Thailand', 'Uganda', 'Venezuela, RB',
       'Benin', 'Bhutan', 'Burkina Faso', 'Cambodia', 'Indonesia',
       'Mozambique', 'Nepal', 'Nicaragua', 'Niger', 'Zambia', 'Zimbabwe',
       'Guinea', 'Cote d’Ivoire', 'Mauritania', 'Canada', 'Australia',
       'Botswana', 'Burundi', 'Cabo Verde', 'Central African Republic',
       'Chile', 'Costa Rica', 'Timor-Leste

In [28]:
mergedf = pd.merge(ud, vd,
                  on="country_name",
                  how="inner",
                  validate="many_to_many",
                  indicator="matched")

In [29]:
mergedf["matched"].value_counts()

matched
both          280210
left_only          0
right_only         0
Name: count, dtype: int64

In [30]:
mdf = pd.merge(ud, vd,
                  on="country_name",
                  how="inner")

In [31]:
len(mdf)

280210

## Problem 2
[Kickstarter](https://www.kickstarter.com/) is a website in which people can pledge financial support for creative projects. Patrons are only charged if a project raises enough money to meet a pre-specified goal, and projects can offer items as "rewards" for patrons who contribute at particular levels. One interesting aspect of Kickstarter is the ability to [search projects by "ending soon"](https://www.kickstarter.com/discover/advanced?sort=end_date). If you have a few dollars to spare and want to feel like a hero, you can swoop in at the last minute to contribute enough for a project to meet its goal.

Cathie So created a project on Kaggle in which she [scraped Kickstarter](https://www.kaggle.com/socathie/kickstarter-project-statistics/data?select=live.csv) and collected data on 4000 live projects (projects that were currently collecting pledges from patrons) as of October 10, 2016, at 5pm Pacific time. The data are here:

In [32]:
kickstarter = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/live.csv")
kickstarter.head(2).T

Unnamed: 0,0,1
Unnamed: 0,0,1
amt.pledged,15823.0,6859.0
blurb,"\n'Catalysts, Explorers & Secret Keepers: Wome...",\nA unique handmade picture book for kids & ar...
by,Museum of Science Fiction,"Tyrone Wells & Broken Eagle, LLC"
country,US,US
currency,usd,usd
end.time,2016-11-01T23:59:00-04:00,2016-11-25T01:13:33-05:00
location,"Washington, DC","Portland, OR"
percentage.funded,186,8
state,DC,OR


### Part a
Notice that the `end.time` column, the date and time at which the project stops accepting pledges, is formatted as follows:
```
2016-11-01T23:59:00-04:00
```
This formatting is "YYYY-MM-DDThh:mm:ss-TZD": four digits for the year, a dash, two digits for the month, another dash, and two digits for the day; the "T" separates the dates from the time; two digits for the hour, minute and second, separated by colons; and the time zone expressed as hours difference from Greenwich mean time (also called UTC), and -04:00 is four hours earlier than UTC, for example.

But `end.time` is also currently read as a string, with `object` data type:

In [33]:
kickstarter.dtypes

Unnamed: 0             int64
amt.pledged          float64
blurb                 object
by                    object
country               object
currency              object
end.time              object
location              object
percentage.funded      int64
state                 object
title                 object
type                  object
url                   object
dtype: object

Convert `end.time` to a timestamp, and extract the month, day, year, hour, minute, and second of the end time. To allow the `pd.to_datetime()` function to read timezones, use the `utc=True` argument. (2 points)

In [34]:
kickstarter["end.time"] = pd.to_datetime(kickstarter["end.time"], utc=True).dt.strftime("%m-%d-%Y %H:%M:%S")
kickstarter["end.time"] = pd.to_datetime(kickstarter["end.time"])
kickstarter.dtypes

Unnamed: 0                    int64
amt.pledged                 float64
blurb                        object
by                           object
country                      object
currency                     object
end.time             datetime64[ns]
location                     object
percentage.funded             int64
state                        object
title                        object
type                         object
url                          object
dtype: object

In [35]:
kickstarter["end.time"]

0      2016-11-02 03:59:00
1      2016-11-25 06:13:33
2      2016-11-24 04:00:00
3      2016-11-02 03:50:00
4      2016-11-19 04:05:48
               ...        
3995   2016-11-20 06:10:00
3996   2016-11-15 21:00:00
3997   2016-10-30 13:36:06
3998   2016-11-17 17:11:26
3999   2016-12-11 05:11:01
Name: end.time, Length: 4000, dtype: datetime64[ns]

### Part b
Create a dataframe with one row for every ending day in the `kickstarter` data that reports the average amount pledged (`amt.pledged`) on each day. Sort the rows in descending order by average amount pledged, and display the five days with the highest averages. (2 points)

In [36]:
dfb = kickstarter.groupby(pd.Grouper(key='end.time', freq='D'))\
.agg({"amt.pledged":"mean"}).sort_values(by="amt.pledged", ascending=False)

In [37]:
dfb.head(5)

Unnamed: 0_level_0,amt.pledged
end.time,Unnamed: 1_level_1
2016-12-14,47938.375
2016-11-04,26975.388889
2016-11-11,24990.669065
2016-12-17,22160.230769
2016-11-18,21016.234043


### Part c
Display the text of the longest `blurb` in the data. (2 points)

In [39]:
kickstarter['len'] = kickstarter.blurb.str.len()
pd.options.display.max_colwidth = None
kickstarter.sort_values(by = 'len', ascending = False).blurb.head(1)

2413    \nWe are charismatic anti-rock band hailing from Winnipeg, Manitoba and we are determined to release a debut album by the summer of 2017!\n
Name: blurb, dtype: object

### Part d
How many blurbs for projects with end dates between November 15, 2016 and December 7, 2016 contain the phrase "science fiction"? [Hint: Don't forget to make this search case-insensitive and to sort the `kickstarter` dataframe by `end.time` before setting `end.time` as the index.] (2 points)

### answer
5

In [40]:
# kickstarter.head(2).T
scifi = kickstarter[kickstarter["blurb"].str.lower().str.contains("science fiction")]
scifi.head().T

Unnamed: 0,0,214,576,604,991
Unnamed: 0,0,214,576,604,991
amt.pledged,15823.0,5781.0,21364.0,5299.0,3642.0
blurb,"\n'Catalysts, Explorers & Secret Keepers: Women of Science Fiction' is a take-home exhibit & anthology by the Museum of Science Fiction.\n",\nLegendary science fiction authors and the makers of Takamo Universe have joined forces to transform the game universe into written word\n,"\nSpruitje makes futuristic designs with light and living ecosystems. Inspired by 70's design, science fiction movies but mainly nature.\n",\nA Science Fiction film filled with entertainment and Excitement\n#sciencefiction #popularity #magic\n,"\nLatest Edition of Science Fiction Gaming Maps from The Adventurers Atlas. 8 New Maps, including massive desert city \nspaceport.\n"
by,Museum of Science Fiction,Randy Ritnour,Spruitje,Chris,Matt Francella
country,US,US,NL,US,US
currency,usd,usd,eur,usd,usd
end.time,2016-11-02 03:59:00,2016-11-29 01:00:00,2016-11-18 07:31:01,2016-12-07 03:25:01,2016-10-30 17:54:26
location,"Washington, DC","Lincoln, NE","Amsterdam, Netherlands","Chicago, IL","Hampton, VA"
percentage.funded,186,165,28,529,72
state,DC,NE,North Holland,IL,VA


In [41]:
scifi.columns = ['Unnamed: 0', 'amt.pledged', 'blurb', 'by', 'country', 'currency',
       'end_time', 'location', 'percentage.funded', 'state', 'title', 'type',
       'url', 'len']

In [42]:
scifi.end_time.value_counts()

end_time
2016-11-02 03:59:00    1
2016-11-29 01:00:00    1
2016-11-18 07:31:01    1
2016-12-07 03:25:01    1
2016-10-30 17:54:26    1
2016-12-23 04:19:54    1
2016-10-30 17:21:58    1
2016-11-06 21:18:56    1
2016-11-12 00:59:09    1
2016-11-30 22:00:00    1
2016-10-31 22:59:00    1
2016-11-18 06:15:00    1
2016-11-17 19:57:17    1
Name: count, dtype: int64

In [43]:
len(scifi)

13

In [44]:
proj = scifi.query("'2016-11-15' <= end_time <= '2016-12-07'")

In [45]:
proj.end_time.sort_values()

3406   2016-11-17 19:57:17
3386   2016-11-18 06:15:00
576    2016-11-18 07:31:01
214    2016-11-29 01:00:00
2500   2016-11-30 22:00:00
Name: end_time, dtype: datetime64[ns]

In [46]:
len(proj)

5