# Lesson 01 - Pandas basics

In [3]:
import pandas as pd
import numpy as np

pd.__version__

'0.22.0'

## Load data and basic info

Let's load a csv file and view it. We use pandas read_csv function to do that:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

The function loads a csv file into an object of a DataFrame type.

There are plenty of options to try out. Below, we parser dataes in two of the columns. Pandas uses the concept of index. You can use any column as an index. If you do not provide any, a sequential index is added.

In [4]:
bugs = pd.read_csv('./data/bugs_train.csv', parse_dates=['Opened', 'Changed'], index_col=None)

If you want to view your data frame you can use head and tail methods:

In [None]:
bugs.head(3)

In [None]:
bugs.head(5)

In [None]:
bugs.tail(3)

Sometimes, we want to see what are the data types of columns in our data frame.

In [None]:
bugs.dtypes

We use shape property to see what are the dimensions of the data frame

In [None]:
bugs.shape

You can check the names of the columns by using the following code:

In [None]:
bugs.columns.get_values()

## Basic data manipulation

Selecting columns

In [None]:
bugs[['Component', 'Assignee']].head(2)

Accessing different cells by the loc method. You have to provide "labels" as coordinates to access a cell. If you use the default index, it is the labels. For columns this will be the names of columns:

In [None]:
# accessing by label - 3 is the index label in this case
bugs.loc[3, "Assignee"]

You can also provide a list of values to select subregions of the data frame.

In [None]:
bugs.loc[1:3, ["Assignee", "Status"]]

The ix methods gives you flexibility since you can access cells by label or position, however the method is now deprecated

In [None]:
bugs.ix[3, "Assignee"]

In [None]:
bugs.ix[3, 1]

However, you can achieve the same behaviour using the standard loc and iloc methods

In [None]:
# using labels when you know the column number
bugs.loc[3, bugs.columns[1]]

We use the iloc method below. It works on indices instead of labels

In [None]:
bugs.iloc[bugs.index.get_loc(3), 1]

Remove rows/columns

In [None]:
# remove rows
bugs.drop([1,2], axis=0).head(4) 

In [None]:
#remove columns
bugs.drop(["Status","Assignee"], axis=1).head(4) 

Rename

In [None]:
bugs.rename(columns={"Summary": "Info"}).head(2)

In [None]:
bugs.rename(columns=str.upper).head(2)

We can also define our own function. Let's it add my_ as a suffix to the name of the column

In [None]:
def my_column(x):
    return "my_"+x
bugs.rename(columns=my_column).head(2)

Changing type to category

In [None]:
bugs.Status.astype("category", categories=["VERIFIED", "RESOLVED", "CLOSED"], ordered=True).head()

Filtering

In [None]:
bugs[bugs['Component'] == 'Debug' ].head(3)

In [None]:
bugs[ (bugs['Component'] == 'Debug') &  (bugs['Severity'] == 'normal') ].head(3)

In [None]:
bugs[ (bugs['Component'] == 'Debug') |  (bugs['Severity'] == 'normal') ].head(3)

In [None]:
bugs[bugs['Opened'] > '2005'].head(2)

In [1]:
bugs[bugs['Priority'].isin(["P1", "P2"])].head(2)

NameError: name 'bugs' is not defined

In [8]:
list(bugs[bugs['Opened'] > '2005'].index)

[23571,
 23572,
 23573,
 23574,
 23575,
 23576,
 23577,
 23578,
 23579,
 23580,
 23581,
 23582,
 23583,
 23584,
 23585,
 23586,
 23587,
 23588,
 23589,
 23590,
 23591,
 23592,
 23593,
 23594,
 23595,
 23596,
 23597,
 23598,
 23599,
 23600,
 23601,
 23602,
 23603,
 23604,
 23605,
 23606,
 23607,
 23608,
 23609,
 23610,
 23611,
 23612,
 23613,
 23614,
 23615,
 23616,
 23617,
 23618,
 23619,
 23620,
 23621,
 23622,
 23623,
 23624,
 23625,
 23626,
 23627,
 23628,
 23629,
 23630,
 23631,
 23632,
 23633,
 23634,
 23635,
 23636,
 23637,
 23638,
 23639,
 23640,
 23641,
 23642,
 23643,
 23644,
 23645,
 23646,
 23647,
 23648,
 23649,
 23650,
 23651,
 23652,
 23653,
 23654,
 23655,
 23656,
 23657,
 23658,
 23659,
 23660,
 23661,
 23662,
 23663,
 23664,
 23665,
 23666,
 23667,
 23668,
 23669,
 23670,
 23671,
 23672,
 23673,
 23674,
 23675,
 23676,
 23677,
 23678,
 23679,
 23680,
 23681,
 23682,
 23683,
 23684,
 23685,
 23686,
 23687,
 23688,
 23689,
 23690,
 23691,
 23692,
 23693,
 23694,
 23695,


Adding a column (new values, based on existing columns)

In [None]:
bugs.assign(x=pd.Series(np.random.randn(bugs.shape[0])).values)

Merge by row

In [None]:
pd.concat([bugs, bugs], axis=0).head(1)

Merge by column

In [None]:
pd.concat([bugs, bugs], axis=1).head(1)

## Tasks

Task 1. Display last ten rows of the bugs data frame

Task 2. Create a new data frame by selected Opened, Changed, and Prority columns from bugs

Task 3. Select 20 row and columns Opened, Changed, Priority of bugs

Task 4. Remove the column Summary from the bugs

Task 5. Select rows for which Assignee is eclipse