# About

This module will run throught he basics of creating columns and sorting data.

In [1]:
# import our tools
import pandas as pd
import numpy as np

In [2]:
# bring in the college dataset
ipeds_url = "https://public.tableau.com/s/sites/default/files/media/Resources/IPEDS_data.xlsx"
ipeds = pd.read_excel(ipeds_url)
ipeds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1534 entries, 0 to 1533
Columns: 145 entries, ID number to Endowment assets (year end) per FTE enrollment (FASB)
dtypes: float64(116), int64(2), object(27)
memory usage: 1.7+ MB


# Creating Columns



In [3]:
# first, what do we really have for columns
list(ipeds.columns)

['ID number',
 'Name',
 'year',
 'ZIP code',
 'Highest degree offered',
 'County name',
 'Longitude location of institution',
 'Latitude location of institution',
 'Religious affiliation',
 'Offers Less than one year certificate',
 'Offers One but less than two years certificate',
 "Offers Associate's degree",
 'Offers Two but less than 4 years certificate',
 "Offers Bachelor's degree",
 'Offers Postbaccalaureate certificate',
 "Offers Master's degree",
 "Offers Post-master's certificate",
 "Offers Doctor's degree - research/scholarship",
 "Offers Doctor's degree - professional practice",
 "Offers Doctor's degree - other",
 'Offers Other degree',
 'Applicants total',
 'Admissions total',
 'Enrolled total',
 'Percent of freshmen submitting SAT scores',
 'Percent of freshmen submitting ACT scores',
 'SAT Critical Reading 25th percentile score',
 'SAT Critical Reading 75th percentile score',
 'SAT Math 25th percentile score',
 'SAT Math 75th percentile score',
 'SAT Writing 25th percentil

In [0]:
# lets make this a much smaller dataset
COLS_2_KEEP = ['ID number', 'Name', 'Applicants total', 'Admissions total', 'Enrolled total']
ipeds2 = ipeds.loc[:, COLS_2_KEEP]
ipeds2.head()

In [0]:
# thats much easier - what do we have for missing dat
ipeds2.isna().sum()

In [0]:
# remove every row that has at least 1 missing value 
ipeds2.dropna(inplace=True)

In [0]:
# confirm
ipeds2.isna().sum()

In [0]:
# create a simple column, every value is one
ipeds2['just1'] = 1
ipeds2.head()

In [0]:
# summary
ipeds2.describe()



---



***What just happened***

Pandas casts the calculation to every row in our dataset, no need to do each row by 1x1.  This is the equivalent of dragging and dropping the formula in excel down every row.




---



In [0]:
# we can create more than constants
ipeds2['double_apps'] = ipeds2['Applicants total'] * 2
ipeds2.head()

In [0]:
# we can even compare columns
ipeds2['yield_rate'] = ipeds2['Enrolled total'] / ipeds2['Admissions total']
ipeds2.head()

# Sorting

Remeber `value_counts()`, well there is `sort_values()`

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

^^ Documentation is our friend

In [0]:
# get help on the DataFrame method sort_values
pd.DataFrame.sort_values?

In [0]:
## sort the data by yield_rate ascending
ipeds2.sort_values('double_apps').head()

> Note that even though we removed all missing values, we still get a NaN because we cant do division by 0. It's pretty great that pandas didn't yell at us, right?

In [0]:
## sort the data by yield_rate ascending
ipeds2.sort_values('double_apps').tail()

In [0]:
## we can also do nested sorting
ipeds2.sort_values(['yield_rate', 'double_apps']).tail()

In [0]:
## these were just temporary sorts, we can use inplace to save
ipeds2.head()

In [0]:
ipeds2.sort_values(['yield_rate', 'double_apps'], inplace=True)

In [0]:
ipeds2.tail()

> `inplace=True` is like assigning the output of the sort to a dataframe, it just does it for us, in place.

# Combine this lesson together

In [0]:
# create a column that is the length of the school's name
ipeds2['name_length'] = ipeds2['Name'].str.len()

In [0]:
ipeds2.head()

> Note:  I corrected this in a previous notebook, but see that we have to access the string methods by .str (the type) against the Series which is a string, to access the methods  like length, lower, etc.

In [0]:
## sort the dataset descending (its just an argument) in place and view the top 5 longest names
ipeds2.sort_values("name_length", ascending=False, inplace=True)
ipeds2.head()