In [1]:
import pandas as pd
import numpy as np

# Loading a CSV

In [2]:
df = pd.read_csv('/data/datasets/auto-mpg.csv', delimiter=',')

In [13]:
df.mpg + df.acceleration ** 2 

0      162.00
1      147.25
2      139.00
3      160.00
4      127.25
5      115.00
6       95.00
7       86.25
8      114.00
9       87.25
10     115.00
11      78.00
12     105.25
13     114.00
14     249.00
15     262.25
16     258.25
17     277.00
18     237.25
19     446.25
20     331.25
21     234.25
22     331.25
23     182.25
24     246.00
25     206.00
26     235.00
27     193.25
28     351.25
29     237.25
        ...  
368    372.96
369    358.00
370    293.44
371    285.00
372    351.00
373    292.96
374    443.25
375    270.09
376    368.24
377    340.76
378    254.09
379    335.29
380    246.25
381    246.25
382    319.61
383    263.00
384    278.49
385    300.44
386    293.96
387    327.00
388    236.25
389    238.09
390    225.21
391    205.00
392    326.29
393    270.36
394    649.16
395    166.56
396    373.96
397    407.36
Length: 398, dtype: float64

# Dataframe

A dataframe in pandas is like a database table or an excel sheet. The data is structured in columns that contain a well-defined value of some datatype. The most common datatypes as int, float and text. The rows are indexed by a row number. Like in database tables, every row is an entry that has values for any or all of the columns.

To avoid looking at too many rows, we can see the first or last rows by using the `head` or `tail` method on a Dataframe.

In [3]:
df[['mpg', 'horsepower', 'weight', 'car name']].head()

Unnamed: 0,mpg,horsepower,weight,car name
0,18.0,130,3504,chevrolet chevelle malibu
1,15.0,165,3693,buick skylark 320
2,18.0,150,3436,plymouth satellite
3,16.0,150,3433,amc rebel sst
4,17.0,140,3449,ford torino


When a table contains too many rows and/or columns, Pandas usually truncates the output. To view all the names of the columns in a Dataframe, we can use the `columns` property, and if that is also truncated we may use `columns.tolist()`.

In [4]:
df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model year', 'origin', 'car name'],
      dtype='object')

# Slicing

Slicing a Dataframe works similar to slicing a Python List. We can point to a range `[]`. This range can be specified by a single row number, or [start:end:step] in which the start end and step are all optional. 

In [5]:
df[:2]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320


When we slice using labels, Pandas uses the labels to select columns. We can specify a selection of columns using a list of labels.

In [6]:
df['mpg'].head()

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

In [7]:
df[['mpg', 'weight']].head()

Unnamed: 0,mpg,weight
0,18.0,3504
1,15.0,3693
2,18.0,3436
3,16.0,3433
4,17.0,3449


In [8]:
df.mpg.head()

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

You can combine slicing of rows and columns by using one after the other.

In [9]:
df.mpg[:2]

0    18.0
1    15.0
Name: mpg, dtype: float64

In [10]:
df[:5][['mpg', 'weight']]

Unnamed: 0,mpg,weight
0,18.0,3504
1,15.0,3693
2,18.0,3436
3,16.0,3433
4,17.0,3449


When you want to have a selection of rows, you can use `.iloc` for positional indexing, note that the syntax is a bit different.

In [11]:
df.iloc[[1,3]]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst


Alternatively, you can also create a new view to a Dataframe by dropping columns with `drop`. By default, this returns a new `view` to a Dataframe in which the columns are dropped. A new and old view to the Dataframe can coexist, meaning that creating such a view with `drop` is not destructive on the original Dataframe.

In [12]:
df.drop(columns=['mpg', 'weight']).head()

Unnamed: 0,cylinders,displacement,horsepower,acceleration,model year,origin,car name
0,8,307.0,130,12.0,70,1,chevrolet chevelle malibu
1,8,350.0,165,11.5,70,1,buick skylark 320
2,8,318.0,150,11.0,70,1,plymouth satellite
3,8,304.0,150,12.0,70,1,amc rebel sst
4,8,302.0,140,10.5,70,1,ford torino


In [13]:
newdf = df.drop(columns='mpg')
newdf.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,8,302.0,140,3449,10.5,70,1,ford torino
