# Data Manipulation with Python #  

This notebook does not serve as a comprehensive introduction or tutorial for python but rather a demonstration of how it can be used. Some resources for learning python:  
- [Online courses through BPL](https://www.bpl.org/learning-tools/online-courses/)
- [Python website](https://www.python.org/)
- [Codecademy](https://www.codecademy.com/)  

In this notebook, we'll run through a quick example of working with csv data using Python. We'll use the Boston [2018 ACS dataset](https://data.census.gov/cedsci/map?g=0500000US25025.140000_1600000US2507000&hidePreview=false&layer=place&tid=ACSST5Y2018.S0101&table=DP02&tp=false&vintage=2018&cid=S0101_C01_001E) from census.gov. Given data on inhabitants' age for each census tract, let's try to find the most populous age group in each one.  

Below you will find three sections:
- [Functions and Methods](#Functions-and-Methods) includes details for the syntax of the different tools we'll be using.
- [Cleaning Data](#Cleaning-Data) shows Python usage for reformatting data and making it easier to work with.
- [Selecting Data](#Selecting-Data) includes both 
    1. pre-written lines for finding the most populous age group within each tract and 
    2. lines to be filled (using the same form) to find where the median age of a tract is over 50.

## Functions and Methods ##

[```pandas.read_csv(filepath)```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) Read a comma-separated values (csv) file into DataFrame.   

[```pandas.DataFrame(data)```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.  

```iloc[]``` Purely integer-location based indexing for selection by position.  
```columns ``` The column labels of the DataFrame.  
```max``` Return the maximum of the values for the requested axis.  
```idxmax``` Return index of first occurrence of maximum over requested axis.


## Cleaning Data ##  
  
Let's start by importing the pandas library. We'll be using the DataFrame object to store and manipulate our data. For more information check out the [user guide](https://pandas.pydata.org/docs/user_guide/index.html)!

In [None]:
import pandas as pd

Now, we need to read in our data. Since it's in csv form, we can use pandas' read_csv function to create a DataFrame. Pandas also contains functions for reading in files with common formatting like json and excel files.

In [None]:
df = pd.read_csv("data/ACSST5Y2018.S0101_data_with_overlays_2020-03-05T114122.csv")
df

You may have noticed that the column names don't really tell us anything about what's stored in them. The first row, however, contains descriptions for each column. To access that we can select a row by its index using iloc.

In [None]:
df.iloc[0]

With this row selected we can now clean up these names by using a python list comprehension.

In [None]:
new_columns = [name.replace("!!", " ") for name in df.iloc[0]]

Set the column names equal to the new ones we just created. 

In [None]:
df.columns = new_columns

Set df equal to rest of dataframe.

In [None]:
df = df.iloc[1:]

In [None]:
df

## Selecting Data ##

In [None]:
cols = []
for column in df.columns:
    #print(column)
    if 'estimate total total population age' in column.lower():
        print(column)
        cols.append(column)

In [None]:
for column in df.columns:
    print(column)

#uncomment line below and set median equal to column name for total population median age
#median = ''

In [None]:
age = df[cols]

In [None]:
median_age = 

In [None]:
age

In [None]:
median_age

In [None]:
print(age.max(axis=1) )

In [None]:
groups = age.astype(float).idxmax(axis=1)
groups

In [None]:
twenties = ['20' in group or '25' in group for group in groups]
print(twenties)


In [None]:
#Find where median age is less then 50
less_than_fifty = 