<p> Pandas, short for Python Data Analysis, is a Python package widely used in data science. While a Numpy array can be used to represent a spreadsheet of data, it is not the best format. Pandas also provides a convenient way to read data from a spreadsheet in an external file. </p>
<p> To read a file, such as a csv or excel file, it should be in the same folder as this jupyter notebook. And if it is a csv file, simply use the read_csv command: </p>

In [14]:
import pandas as pd
df = pd.read_csv("iris.csv")
print(df)

     SepalLength  SepalWidth  PetalLength  PetalWidth         Species
0            5.1         3.5          1.4         0.2     Iris-setosa
1            4.9         3.0          1.4         0.2     Iris-setosa
2            4.6         3.1          1.5         0.2     Iris-setosa
3            5.0         3.6          1.4         0.2     Iris-setosa
4            5.4         3.9          1.7         0.4     Iris-setosa
..           ...         ...          ...         ...             ...
145          6.5         3.0          5.5         1.8  Iris-virginica
146          7.7         2.6          6.9         2.3  Iris-virginica
147          6.0         2.2          5.0         1.5  Iris-virginica
148          6.9         3.2          5.7         2.3  Iris-virginica
149          6.2         2.8          4.8         1.8  Iris-virginica

[150 rows x 5 columns]


The object we got here, named df, is a dataframe, a Pandas object most commonly used in data science. And you can see that it looks like a spreadsheet if printed. Another way of having a quick look at the dataframe, without the risk of printing something that could occupy the entire screen if the spreadsheet is too big, is to use the .head command, which prints the first 5 rows of the dataframe:

In [15]:
df.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


Another useful function that gives an overall view of the data is info.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLength    150 non-null float64
SepalWidth     150 non-null float64
PetalLength    150 non-null float64
PetalWidth     150 non-null float64
Species        150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


To select one particular column of the dataframe, the code is df[column name]:

In [17]:
a = df["SepalLength"]
print(a)

0      5.1
1      4.9
2      4.6
3      5.0
4      5.4
      ... 
145    6.5
146    7.7
147    6.0
148    6.9
149    6.2
Name: SepalLength, Length: 150, dtype: float64


Another way is df.columnname, but it only works if the column name without the quotes is a legal Python variable name.

In [18]:
df.SepalLength

0      5.1
1      4.9
2      4.6
3      5.0
4      5.4
      ... 
145    6.5
146    7.7
147    6.0
148    6.9
149    6.2
Name: SepalLength, Length: 150, dtype: float64

Once you have selected a column, you can treat it as a normal Python iterable and do calculations on it. For example find the length and sum:

In [27]:
a = df["SepalLength"]
print(len(a))
print(sum(a))

150
876.5000000000001


To select a row, use iloc.

In [19]:
df.iloc[100]

SepalLength               6.5
SepalWidth                3.2
PetalLength               5.1
PetalWidth                  2
Species        Iris-virginica
Name: 100, dtype: object

To select a number of rows, use the .iloc command, with the indexing similar to slicing of a Python list:

In [20]:
df.iloc[35:55]

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
35,4.4,3.0,1.3,0.2,Iris-setosa
36,5.1,3.4,1.5,0.2,Iris-setosa
37,5.0,3.5,1.3,0.3,Iris-setosa
38,4.5,2.3,1.3,0.3,Iris-setosa
39,5.1,3.8,1.9,0.4,Iris-setosa
40,4.8,3.0,1.4,0.3,Iris-setosa
41,5.1,3.8,1.6,0.2,Iris-setosa
42,4.6,3.2,1.4,0.2,Iris-setosa
43,5.3,3.7,1.5,0.2,Iris-setosa
44,5.0,3.3,1.4,0.2,Iris-setosa


If you are selecting multiple rows and multiple columns, use iloc:

In [21]:
df.iloc[80:100,2:]

Unnamed: 0,PetalLength,PetalWidth,Species
80,4.4,1.2,Iris-versicolor
81,4.6,1.4,Iris-versicolor
82,4.0,1.2,Iris-versicolor
83,3.3,1.0,Iris-versicolor
84,4.2,1.3,Iris-versicolor
85,4.2,1.2,Iris-versicolor
86,4.2,1.3,Iris-versicolor
87,4.3,1.3,Iris-versicolor
88,3.0,1.1,Iris-versicolor
89,4.1,1.3,Iris-versicolor


Find the number of unique values and their respective counts in a particular column:

In [22]:
df['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

Sometimes you want to select all rows of which the value of a column is a particular value. For example, if you want all rows of which the Species is Iris-setosa:

In [23]:
df[df['Species']=='Iris-setosa']

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa
7,4.4,2.9,1.4,0.2,Iris-setosa
8,5.4,3.7,1.5,0.2,Iris-setosa
9,4.8,3.4,1.6,0.2,Iris-setosa


Select rows of which a certain column is within a range.

In [24]:
df[df['SepalLength']<5]

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa
7,4.4,2.9,1.4,0.2,Iris-setosa
9,4.8,3.4,1.6,0.2,Iris-setosa
10,4.8,3.0,1.4,0.1,Iris-setosa
11,4.3,3.0,1.1,0.1,Iris-setosa
19,4.6,3.6,1.0,0.2,Iris-setosa
21,4.8,3.4,1.9,0.2,Iris-setosa
26,4.7,3.2,1.6,0.2,Iris-setosa


To sort the rows in a dataframe according to the values in a certain column, use <code>sort_values</code>.

In [25]:
df.sort_values(by=['SepalWidth'])

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
53,5.0,2.0,3.5,1.0,Iris-versicolor
147,6.0,2.2,5.0,1.5,Iris-virginica
55,6.0,2.2,4.0,1.0,Iris-versicolor
61,6.2,2.2,4.5,1.5,Iris-versicolor
48,5.5,2.3,4.0,1.3,Iris-versicolor
...,...,...,...,...,...
13,5.4,3.9,1.3,0.4,Iris-setosa
137,5.8,4.0,1.2,0.2,Iris-setosa
29,5.2,4.1,1.5,0.1,Iris-setosa
30,5.5,4.2,1.4,0.2,Iris-setosa


To drop a column or row, use drop, with <code>axis=0</code> meaning rows and <code>axis=1</code> meaning columns. Note the dropping is not in place by default.

In [26]:
df1 = df.drop(["Species"],axis=1)
df1.head()

KeyError: "['variety'] not found in axis"

There are obviously more. Read the documentation for more useful Pandas functionalities!

Exercise: <br>
Read the "auto-mpg.csv" file into a Pandas dataframe. It contains data about car models. 
Find the maximum, minimum, and average value of MPG of all cars.
Find the numbers of cars that have 4, 6, or 8 cylinders.
Find the number of cars with less than 100 horsepower.
Find the average mpg of cars that have 6 cylinders.