## Tutorial to Pandas

Created by **John C.S. Lui** for CSCI3320 (Fundamentals of Machine Learning).

**Date:**  Jan 23, 2021.

We will go over:

1. Install Pandas
2. Load a file (e.g., comma separated values (CSV) file
3. Find out various display options 
4. Examine the data as well as schema of the data file
5. Relationship with Python's dictionary

## Pandas

Pandas is an open **source data analysis** and **manipulation tool**, built on top of the Python programming language. It helps us to *open, modify* and *visualize* our data.  

To install pandas, do <br>
% `pip install pandas`

To proceed with this tutorial, you need to first download some data files from <br>
https://insights.stackoverflow.com/survey

I use the 2019 dataset.  You shold download the file `developer_survey_2019.zip`, unzip it and put all the files under the sub-directory, `data`.


In [None]:
import pandas as pd   # import pandas and use pd as alias

In [None]:
# open the csv file, we can even take a look of that raw csv file
# df is a variable, which is like a file handler and df stands for dataframe

df = pd.read_csv('data/survey_results_public.csv')

#### Comments: 
It said our data file has 88883 rows and 85 columns (attributes/features).  When we look at the display, we see
that it nicely **cut** out some columns and rows.

In [None]:
# let's look at the last 10 rows of that data file
df.tail(10)
#df.head(10)

In [None]:
# We can find out the `dimension` of our data file by using `shape`, which gives us the rows by column information
df.shape

In [None]:
# Let's get the information of our dataframe, it shows each feature, its name, data type,...etc

df.info()

In [None]:
# Let's changee the jupyter's setting so that we can see all 85 features 
pd.set_option('display.max_columns', 85)
df

In [None]:
# We can examine the schema of the data file  by reading the corresponding given schema file
schema_df = pd.read_csv('data/survey_results_schema.csv')
schema_df

In [None]:
pd.set_option('display.max_rows', 6) 
df

In [None]:
schema_df

In [None]:
df.head()  # display the first 5 rows of the dataframe

In [None]:
df.head(10) # display the first 10 rows

In [None]:
# display the columns (or features) of the dataframe
df.columns

In [None]:
# How should we look at the dataframe, in particular, for each feature (column)?
# In Python's language, each column corresponds to a dictionary, with key being the featue, and values being
# the data (or rows) of the corresponding feature.

# Let's look at the feature 'Hobbyist'
df['Hobbyist']

In [None]:
# We can also examine some rows with a particular feature(s)
df.loc[0:5, 'Hobbyist']

In [None]:
df.loc[0:5, 'Hobbyist':'Employment']

In [None]:
# This can be illustrated by the following example of creating a dictionary and then use pandas to view it

# define dictionary
people = {
        "first": ['John', 'Jack', 'Jill'],
        "last":  ['Lui', 'Lee', 'Chan'], 
        "email": ["cslui@cse.cuhk.edu.hk", "jacklee@cse.cuhk.edu.hk", "Jill_Lee@cse.cuhk.edu.hk"] }

people["email"]

In [None]:
# now use Pandas to view all attributes
import pandas as pd

simple_df = pd.DataFrame(people)
simple_df

In [None]:
simple_df['email']

In [None]:
simple_df.columns  # shows the features of the dataframe

In [None]:
# Show the data from a particular row
simple_df.iloc[0]

In [None]:
simple_df.iloc[0:2]

In [None]:
# access particular rows
simple_df.iloc[[0,2]]     # to access first and last row

In [None]:
# access particular rows
simple_df.iloc[[2,0,1]]     # to access last and first and second row

In [None]:
simple_df.iloc[0,2]   # access a particular row (data) and column (feature)

In [None]:
# Or we can display certain rows with certain feature
simple_df.iloc[[2,0],2]

In [None]:
# Or we can display certain rows with certain features
simple_df.iloc[[2,0],[2,1]]

In [None]:
simple_df.loc[[2,0],'email']    # Note that we are using 'loc' now, now 'iloc'

In [None]:
simple_df.loc[[2,0],['email', 'first']]

In [None]:
df.columns

In [None]:
# show data 1, 2 and 4 with particular features only
df.loc[[1,2,4], ['Hobbyist', "OpenSource", 'EdLevel']]

In [None]:
# We can also do some counting
df['Hobbyist'].value_counts()