# Introduction

Pandas is a very popular Python library. To use it, we need to do import first.

In [0]:
import pandas as pd

# Loading the Data

Pandas supports handling datasets in different file formats. The most common format is CSV (comma-separated value) file.

This tutorial will use a sports dataset that contains the results of NCAA basketball games from 1985 to 2016. We are going to use **pd.read_csv()** to read the data. And the function will returns a **dataframe** variable.

In [0]:
df = pd.read_csv('https://github.com/jsphchan/FundAI/blob/master/Lab3/RegularSeasonCompactResults.csv?raw=true')

# Understanding the Data

If this is the first time we come across the dataset, it is important for us to take a look and see what is inside.
To start with, let's use the **shape** attribute to find out the dimensions of the dataframe.

In [0]:
df.shape

Let's use **head()** to see the first few rows of the dataframe (or **tail()** to see the last few rows).

In [0]:
df.head()

In [0]:
df.tail()

The **describe()** function is a convenient tool that can help us obtain statistics like mean, standard deviation, percentile abot each column of the dataframe.

In [0]:
df.describe()

If we want to find out the maximum value of a particular column (for example, the Wscore), we can select the column using bracket operator, and then use **max()** function.

In [0]:
df['Wscore'].max()

Similarly, we can find out the average value of a column using **mean()** function.

In [0]:
df['Lscore'].mean()

It is also possible for us to find out the row number where the maximum score is located. We can call the **idxmax()** function to get the row index.

In [0]:
df['Wscore'].idxmax()

Another useful function is **value_counts()** which shows how many times each value appears in the column.

In [0]:
df['Season'].value_counts()

#Row and Column Filtering

Because the dataset contains a lot of information, quite often we need to select only those columns that are relevant to the tasks at hand. We can pick out the useful information by columns.

In [0]:
df[['Wteam', 'Wscore']]

Also we can do conditional row filtering. Take note that we can put multiple conditions in a single row filtering statement.

In [0]:
df[(df['Wscore'] > 150) & (df['Lscore'] < 100)]

Note that, after we select a single column with single square bracket, the return object becomes a Pandas **Series** instead of dataframe. In Pandas, Series is 1-dimensional data structure where all data belong to the same type, while Dataframe is 2-dimensional data structure where different columns may belong to different data types.

In [0]:
print(type(df['Wscore']))

In [0]:
print(type(df[['Wscore']]))

# Visualizing Data

An interesting way of displaying Dataframes is through matplotlib. 

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline
ax = df['Wscore'].plot.hist(bins=40)
ax.set_xlabel('Points for Winning Team')