# Intro to Pandas

Pandas is a Python library that is similar to the `datascience` module you have been using in DSC10.  Frankly, I find the official dicumentation ([here](https://pandas.pydata.org/)) kind of confusing, and mostly use Stack Overflow to figure out how to do things.  Here we will go over some of the basic ways to manipulate tabluar data with Pandas.

[This video](https://vimeo.com/59324550) is a good introduction also, and is paired with a Jupyter notebook ([here](http://nbviewer.jupyter.org/urls/gist.github.com/wesm/4757075/raw/a72d3450ad4924d0e74fb57c9f62d1d895ea4574/PandasTour.ipynb)) that covers things slightly differently.

This notebook is adapted from Dennis Tenin's [Lede Program](https://github.com/ledeprogram/courses/blob/master/README.md)

(FYI - more good basketball data is at [538](http://eightthirtyfour.com/data))

In [None]:
import pandas as pd

In [None]:
nba_df = pd.read_csv("NBA 2013.csv")

In [None]:
# Look at the first 10 rows
nba_df.head(10)

In [None]:
# Find out how many players are in each position
nba_df["POS"].value_counts()

In [None]:
# Get all of the people who match a certain characteristic
nba_df[nba_df["POS"] == "F"].head()

In [None]:
# Get all of the people who match a certain characteristic
nba_df[(nba_df["POS"] == "F") & (nba_df["HS Only"] == "No") ].head()

In [None]:
# Get all of the people who match one of any X characteristics
nba_df[(nba_df["POS"] == "F") | (nba_df["POS"] == "G") ].head()

In [None]:
# Retrieve what's nan/null/etc
nba_df[pd.isnull(nba_df["Race"])].head()

In [None]:
# Retrieve what's NOT nan/null/etc
nba_df[~pd.isnull(nba_df["Race"])].head()

In [None]:
# or this
nba_df[pd.notnull(nba_df["Race"])].head()

In [None]:
nba_df[pd.notnull(nba_df["Race"])].head()

In [None]:
# Get numerical data on a column
# If you're dealing with labels or groups, use .value_counts()
nba_df["Age"].describe()

In [None]:
# Get numerical data on grouped data
nba_df.groupby("POS")["Age"].describe()

In [None]:
# Remove columns that you HATE with .drop
# Need to save it as a new (or the same) variable
nba_df = nba_df.drop(["City"], axis=1)
nba_df.columns

In [None]:
# Calculate a new column from an existing column
nba_df["Ht (Cm.)"] = nba_df["Ht (In.)"] * 2.54
nba_df[:2]

In [None]:
# String manipulation on an entire column
# Need to use .str to treat it as a string
nba_df["Name"].str.lower()

In [None]:
# Do more intense manipulation with .apply + an external function
# You will always forget to do axis=1, so remember it!
# Just treat row like a dictionary, it goes one at a time
def do_i_like_them(row):
    if row["Age"] >= 31:
        return True
    else:
        return False

nba_df["Liked"] = nba_df.apply(do_i_like_them, axis=1)
nba_df["Liked"].value_counts()

In [None]:
# Get one column of a dataframe
nba_df.iloc[0]

In [None]:
# For loops with dataframes
# Can't do for row in nba_df, gotta use iterrows()
for index, row in nba_df.iterrows():
    print(str(index) + ": " + row["Name"])

In [None]:
# Grouping by as many as you want
# Be sure to put the groupby stuff in square brackets
nba_df.groupby(["POS", "Race"])["Age"].describe()

In [None]:
# Histograms
# Shows you the spread of one numerical value
nba_df["Age"].hist()

In [None]:
# Scatterplots show you the relationship of two numerical values
nba_df.plot("Ht (In.)","WT", kind='scatter')