# In this lecture we'll talk broadly about datasets.

Credit: https://www.basketball-reference.com/ for the datasets.

# Importing Datafrom a file

In [None]:
import pandas as pd
import numpy as np
#Take a look at bball1.csv, you can just open it in Jupyter
#csv is "comment separated values", this is data separted by commas
#Pandas knows very well how to handle this data
pd.read_csv('data/bball1.csv')

pandas does not know which column to make the index. We'll have to tell it.

In [None]:
data = pd.read_csv('data/bball1.csv',index_col='Name')

In [None]:
data

# Combining Data

Lets say we somehow stumbled upon new rows of data

In [None]:
data_rows = pd.read_csv('data/bball3.csv',index_col='Name')

In [None]:
data_rows

In [None]:
pd.concat([data,data_rows])
#Just makes a copy

In [None]:
#If we stumbled upon new columns
data_cols = pd.read_csv('data/bball2.csv',index_col='Name')

In [None]:
pd.concat([data,data_cols],axis=1)

In general you will only use `concat when this is easy`.

In [None]:
data_new_names = pd.read_csv('data/join_data.csv',index_col='Name')

In [None]:
data_new_names

In [None]:
data

Data new names has all different columns and same of the same names and some new names.
We may want different things to happen when we put the two frames together.


`DF.join` allows for many options.

In [None]:
data.join(data_new_names)

## What is this doing?
* Finds rows by in `data_new_names` that are in `data`, drops rows that aren't
* Tries to get information of all of the columns in both `data` and `data_new_names` for all rows
* If there is no value, puts in a `NaN`

What if we wanted to only select the rows with index that are in both?

In [None]:
data.join(data_new_names,how='inner')

Rows with index in either?

In [None]:
data.join(data_new_names,how='outer')

In [None]:
#Only rows that are data_new_names?
data.join(data_new_names,how='right')

# The `on` argument in `join`
If for some reason we want to join two dataframes but not on their index. For example if name was not our index column. We could use the on keyword.
There must not be duplicate entries in the column we choose (or else pandas doesn't know what to do).
For example, if on the nets we had 3 players, one averaging 25 points, one averaging 26, and one average 27 points and on the pacers we had the same thing, we may want to merge these nets and pacers players on points. This won't come up during this class.

# You can get even fancier with `DF.merge`

`Merge` can be thought of as a generalized `join`.
For my purposes, I'm always happy to use `join`. You can read more about `merge` here if you want.
https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

# More on the index

What if we decided we didn't want to index by name?

In [None]:
data

In [None]:
data.reset_index()
#turns name back into a column and puts ints as index

In [None]:
data_reset = data.reset_index()


In [None]:
#set 3p% to index
data.set_index('3P%')
#only makes a copy

In [None]:
#entries should be unique but you wont get error
data.set_index('Age')

In [None]:
data_age = data.set_index('Age')

In [None]:
#bizarre
data_age.loc[24]

In [None]:
#Changing the order, add and removing values from index
data

In [None]:
inds_shuffled = pd.Series(data.index).sample(frac=1).values

In [None]:
inds_shuffled

In [None]:
data.reindex(inds_shuffled)
#changes order

In [None]:
inds_shuffled[:5]

In [None]:
data.reindex(inds_shuffled[:5])
#removes missing

In [None]:
np.concatenate((inds_shuffled,['Lebron James']))

In [None]:
data.reindex(np.concatenate((inds_shuffled,['Lebron James'])))
#adds new rows but all NaNs

# Saving to a file

In [None]:
data.to_csv(<FILE_NAME>)