# Day 4: Session A - Data Frames

[Link to session webpage](https://eds-217-essential-python.github.io/course-materials/interactive-sessions/4a_dataframes.html)


In [1]:
import pandas as pd
import numpy as np

## Import Data

In [2]:
url = "https://raw.githubusercontent.com/datasets/world-cities/master/data/world-cities.csv"
cities_df = pd.read_csv(url)

## View the data

In [6]:
#these are functions, that's why ()
print(cities_df.head())
print(cities_df.tail())

                  name               country             subcountry  geonameid
0         les Escaldes               Andorra     Escaldes-Engordany    3040051
1     Andorra la Vella               Andorra       Andorra la Vella    3041563
2   Umm Al Quwain City  United Arab Emirates  Imārat Umm al Qaywayn     290594
3  Ras Al Khaimah City  United Arab Emirates        Raʼs al Khaymah     291074
4           Zayed City  United Arab Emirates              Abu Dhabi     291580
              name   country           subcountry  geonameid
26462     Bulawayo  Zimbabwe             Bulawayo     894701
26463      Bindura  Zimbabwe  Mashonaland Central     895061
26464   Beitbridge  Zimbabwe   Matabeleland South     895269
26465      Epworth  Zimbabwe               Harare    1085510
26466  Chitungwiza  Zimbabwe               Harare    1106542


### Data frame properties

In [13]:
# Number of rows and columns
# returns a tuple of rows and columns
# shape is like a variable, head and tail are functions. 
print("Shape:", cities_df.shape)

# Column names
# just a variable that contains info about the data frame. 
print("\nColumns:", cities_df.columns)

# Data types of each column. Index type
print("\nData types:\n", cities_df.dtypes)

# This is using a method rather than an attribute!!
# Summary statistics of numeric columns (if any)
print("\nSummary statistics:\n", cities_df.describe())

# detailed info about column types and content
print("\nInfo:\n", cities_df.info())


Shape: (26467, 4)

Columns: Index(['name', 'country', 'subcountry', 'geonameid'], dtype='object')

Data types:
 name          object
country       object
subcountry    object
geonameid      int64
dtype: object

Summary statistics:
           geonameid
count  2.646700e+04
mean   2.858410e+06
std    2.167506e+06
min    1.057000e+04
25%    1.274182e+06
50%    2.524907e+06
75%    3.589464e+06
max    1.254173e+07
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26467 entries, 0 to 26466
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        26467 non-null  object
 1   country     26467 non-null  object
 2   subcountry  26439 non-null  object
 3   geonameid   26467 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 827.2+ KB

Info:
 None


### Check for missing values

In [15]:
# sum's over all columns to collapse them into rows. 
# Usually always summing over columns, not rows. 
print(cities_df.isnull().sum())

name           0
country        0
subcountry    28
geonameid      0
dtype: int64


## Step 2: Cleaning Data

For removing missing data, `dropna()` is best

In [16]:
# only one column has NAs, so you can subset if you're unsure
cities_df = cities_df.dropna(subset=['subcountry'])

In [17]:
print(cities_df.isnull().sum())

name          0
country       0
subcountry    0
geonameid     0
dtype: int64


## Basic Selecting and Filtering

a data frame is a bunch of series that point to an index in a column, they all share the same row index, they all have their own column index


In [21]:
# Select a single column
# Easy! just add it to the data frame with the brackets
print(cities_df['name'].head()) # only using head to keep notebook clean
# requesting a list of columns out of a df always makes a new df
# requesting just one column makes a series


# Select multiple columns
print(cities_df[['name', 'country', 'subcountry']].head())
# need to put a list in there to get multiple columns

0           les Escaldes
1       Andorra la Vella
2     Umm Al Quwain City
3    Ras Al Khaimah City
4             Zayed City
Name: name, dtype: object
                  name               country             subcountry
0         les Escaldes               Andorra     Escaldes-Engordany
1     Andorra la Vella               Andorra       Andorra la Vella
2   Umm Al Quwain City  United Arab Emirates  Imārat Umm al Qaywayn
3  Ras Al Khaimah City  United Arab Emirates        Raʼs al Khaymah
4           Zayed City  United Arab Emirates              Abu Dhabi


To make a series from a column: `df['column']`

To make data frame from a column, request as a single item list: `df[ ['column'] ]`

## Filtering

In [22]:
# Cities in the United States
us_cities = cities_df[cities_df['country'] == 'United States']
#rows where country in cities_df equals US
print(us_cities[['name', 'country']].head())

Can combine logical operators to filter on multiple columns

In [27]:
# Cities in California
# the long way
in_us = cities_df['country'] == 'United States'
in_ca = cities_df['subcountry'] == 'California'
california_cities = cities_df[ in_us & in_ca ]
california_cities.head()

Unnamed: 0,name,country,subcountry,geonameid
24818,Fillmore,United States,California,5284756
24867,Adelanto,United States,California,5322400
24868,Agoura,United States,California,5322551
24869,Agoura Hills,United States,California,5322553
24870,Alameda,United States,California,5322737


`&` is an operator. pandas overloaded & to do comparisons of long lists of things.

`and` is a python thing that doesn't know how to do that 

In [29]:
# the short way
# need to wrap each condition in ( ) to avoid confusion
california_cities = cities_df[
    (cities_df['country'] == 'United States') & 
    (cities_df['subcountry'] == 'California')
]
print(california_cities[['name', 'country', 'subcountry']].head())

               name        country  subcountry
24818      Fillmore  United States  California
24867      Adelanto  United States  California
24868        Agoura  United States  California
24869  Agoura Hills  United States  California
24870       Alameda  United States  California


## Step 5 Sorting and Ranking

## Step 6 Transformations

## Step 7 Grouping and Aggregation

groupby

Take something we know to be a category and groups the data by that catgeory
take that mapping, go to this column, and do x function on it.
i.e., find the the mean weight of dogs grouped by type

group things and get aggregate information about each category
50% of what you're doing in descriptive statistics

In [33]:
cities_per_country = cities_df.groupby('country')

cities_per_country['name'].count()


country
Afghanistan           49
Aland Islands          1
Albania               21
Algeria              250
American Samoa         1
                    ... 
Vietnam              116
Wallis and Futuna      1
Yemen                 23
Zambia                29
Zimbabwe              28
Name: name, Length: 225, dtype: int64

In [30]:
# Number of cities by country
# Groupby looks for the name of the column you wanna group b

cities_per_country = cities_df.groupby('country')['name'].count().sort_values(ascending=False)
print(cities_per_country.head())

# Number of subcountries (e.g., states, provinces) by country
subcountries_per_country = cities_df.groupby('country')['subcountry'].nunique().sort_values(ascending=False)
print(subcountries_per_country.head())

country
United States    3273
India            2480
China            1955
Brazil           1217
Germany          1117
Name: name, dtype: int64
country
Russia      83
Turkey      81
Thailand    75
Vietnam     62
Algeria     53
Name: subcountry, dtype: int64
