Skip to content

manipulate dataset with pandas

Khelil Sator edited this page Jun 30, 2019 · 18 revisions

Pandas is a python library for data manipulation

import the pandas library

>>> import pandas as pd

create a list

>>> bear_family = [[100, 5  , 20, 80], [50 , 2.5, 10, 40], [110, 6  , 22, 80]]
>>> bear_family
[[100, 5, 20, 80], [50, 2.5, 10, 40], [110, 6, 22, 80]]
>>> type(bear_family)
<class 'list'>
>>> 

instanciate the DataFrame class

>>> bear_family_df = pd.DataFrame(bear_family)
>>> type(bear_family_df)
<class 'pandas.core.frame.DataFrame'>
>>> bear_family_df
     0    1   2   3
0  100  5.0  20  80
1   50  2.5  10  40
2  110  6.0  22  80

specify column and row names

>>> bear_family_df = pd.DataFrame(bear_family, index = ['mom', 'baby', 'dad'], columns = ['leg', 'hair','tail', 'belly'])
>>> bear_family_df
      leg  hair  tail  belly
mom   100   5.0    20     80
baby   50   2.5    10     40
dad   110   6.0    22     80

access the leg column of the table

>>> bear_family_df.leg
mom     100
baby     50
dad     110
Name: leg, dtype: int64
>>> bear_family_df["leg"]
mom     100
baby     50
dad     110
Name: leg, dtype: int64
>>> bear_family_df["leg"].values
array([100,  50, 110])

Let's now access dad bear

by his position (2)

>>> bear_family_df.iloc[2]
leg      110.0
hair       6.0
tail      22.0
belly     80.0
Name: dad, dtype: float64

by his name

>>> bear_family_df.loc["dad"]
leg      110.0
hair       6.0
tail      22.0
belly     80.0
Name: dad, dtype: float64

filter lines

find out which bear has a leg of 110:

>>> bear_family_df["leg"] == 110
mom     False
baby    False
dad      True
Name: leg, dtype: bool

select the bears that have a leg of 110

>>> bears_110 = bear_family_df[bear_family_df["leg"] == 110]
>>> bears_110
     leg  hair  tail  belly
dad  110   6.0    22     80 

select the bears that have a belly size of 80

>>> mask = bear_family_df["belly"] == 80
>>> bears_80 = bear_family_df[mask]
>>> bears_80
     leg  hair  tail  belly
mom  100   5.0    20     80
dad  110   6.0    22     80

use the operator ~ to select the bears that don't have a belly size of 80

>>> bear_family_df[~mask]
      leg  hair  tail  belly
baby   50   2.5    10     40

create a new dataframe with 2 new bears. use the same columns as bear_family_df

>>> some_bears = pd.DataFrame([[105,4,19,80],[100,5,20,80]], columns = bear_family_df.columns) 
>>> some_bears
   leg  hair  tail  belly
0  105     4    19     80
1  100     5    20     80

assemble the two DataFrames together

>>> all_bears = bear_family_df.append(some_bears)
>>> all_bears
      leg  hair  tail  belly
mom   100   5.0    20     80
baby   50   2.5    10     40
dad   110   6.0    22     80
0     105   4.0    19     80
1     100   5.0    20     80

drop duplicates

In the DataFrame all_bears, the first bear (mom) and the last bear have exactly the same measurements.

>>> all_bears = all_bears.drop_duplicates()
>>> all_bears
      leg  hair  tail  belly
mom   100   5.0    20     80
baby   50   2.5    10     40
dad   110   6.0    22     80
0     105   4.0    19     80

get names of columns

>>> bear_family_df.columns
Index(['leg', 'hair', 'tail', 'belly'], dtype='object')

Add a new column to a DataFrame

mom and baby are female, dad is male

>>> bear_family_df["sex"] = ["f", "f", "m"]
>>> bear_family_df
      leg  hair  tail  belly sex
mom   100   5.0    20     80   f
baby   50   2.5    10     40   f
dad   110   6.0    22     80   m

get the number of items

>>> len(bear_family_df)
3

get the distinct values for a columns

>>> bear_family_df.belly.unique()
array([80, 40])

read a csv file with Pandas

>>> import os
>>> os.getcwd()
'/home/ksator'
>>> data = pd.read_csv("seaborn-data/iris.csv", sep=",")

load the titanic dataset from seaborn

>>> import seaborn as sns
>>> titanic = sns.load_dataset('titanic')
>>> type(titanic)
<class 'pandas.core.frame.DataFrame'>

displays the first elements of the DataFrame

>>> titanic.head(5)
   survived  pclass     sex   age  sibsp  parch     fare embarked  class    who  adult_male deck  embark_town alive  alone
0         0       3    male  22.0      1      0   7.2500        S  Third    man        True  NaN  Southampton    no  False
1         1       1  female  38.0      1      0  71.2833        C  First  woman       False    C    Cherbourg   yes  False
2         1       3  female  26.0      0      0   7.9250        S  Third  woman       False  NaN  Southampton   yes   True
3         1       1  female  35.0      1      0  53.1000        S  First  woman       False    C  Southampton   yes  False
4         0       3    male  35.0      0      0   8.0500        S  Third    man        True  NaN  Southampton    no   True
>>> titanic.age.head(5)
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64

displays the latest elements of the DataFrame.

>>> titanic.tail(5)
     survived  pclass     sex   age  sibsp  parch   fare embarked   class    who  adult_male deck  embark_town alive  alone
886         0       2    male  27.0      0      0  13.00        S  Second    man        True  NaN  Southampton    no   True
887         1       1  female  19.0      0      0  30.00        S   First  woman       False    B  Southampton   yes   True
888         0       3  female   NaN      1      2  23.45        S   Third  woman       False  NaN  Southampton    no  False
889         1       1    male  26.0      0      0  30.00        C   First    man        True    C    Cherbourg   yes   True
890         0       3    male  32.0      0      0   7.75        Q   Third    man        True  NaN   Queenstown    no   True
>>> titanic.age.tail(5)
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, dtype: float64

returns the unique values present in a Pandas data structure.

>>> titanic.age.unique()
array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

The method describe provides various statistics (average, maximum, minimum, etc.)

>>> titanic['age'].describe(include="all")
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

>>> titanic.describe(include="all")
          survived      pclass   sex         age       sibsp       parch        fare embarked  class  who adult_male deck  embark_town alive alone
count   891.000000  891.000000   891  714.000000  891.000000  891.000000  891.000000      889    891  891        891  203          889   891   891
unique         NaN         NaN     2         NaN         NaN         NaN         NaN        3      3    3          2    7            3     2     2
top            NaN         NaN  male         NaN         NaN         NaN         NaN        S  Third  man       True    C  Southampton    no  True
freq           NaN         NaN   577         NaN         NaN         NaN         NaN      644    491  537        537   59          644   549   537
mean      0.383838    2.308642   NaN   29.699118    0.523008    0.381594   32.204208      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
std       0.486592    0.836071   NaN   14.526497    1.102743    0.806057   49.693429      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
min       0.000000    1.000000   NaN    0.420000    0.000000    0.000000    0.000000      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
25%       0.000000    2.000000   NaN   20.125000    0.000000    0.000000    7.910400      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
50%       0.000000    3.000000   NaN   28.000000    0.000000    0.000000   14.454200      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
75%       1.000000    3.000000   NaN   38.000000    1.000000    0.000000   31.000000      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN
max       1.000000    3.000000   NaN   80.000000    8.000000    6.000000  512.329200      NaN    NaN  NaN        NaN  NaN          NaN   NaN   NaN

NaN stands for Not a Number

>>> titanic.age.head(10)
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: age, dtype: float64

use the fillna method to replace NaN with other values

This returns a DataFrame where all NaN in the age column have been replaced by 0.

>>> titanic.fillna(value={"age": 0}).age.head(10)
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     0.0
6    54.0
7     2.0
8    27.0
9    14.0
Name: age, dtype: float64

The above example doesnt change the dataset.
To change it run titanic['age'] = titanic.fillna(value={'age': 0}).age or titanic.fillna(value={'age': 0}, inplace=True)

use the fillna method to replace NaN with other values

This returns a DataFrame where all NaN in the age column have been replaced with the previous values

>>> titanic.fillna(method="pad").age.head(10)
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5    35.0
6    54.0
7     2.0
8    27.0
9    14.0
Name: age, dtype: float64

The above example doesnt change the dataset.
To change it run titanic['age'] = titanic.fillna(method="pad").age or titanic.fillna(method="pad", inplace=True)

use the dropna method to delete columns or rows/lines that contain NaN

By default, it deletes the lines that contains NaN.

>>> titanic['age'].dropna().head(10)
0     22.0
1     38.0
2     26.0
3     35.0
4     35.0
6     54.0
7      2.0
8     27.0
9     14.0
10     4.0
Name: age, dtype: float64
>>> titanic.dropna().head(10)
    survived  pclass     sex   age  sibsp  parch      fare embarked   class    who  adult_male deck  embark_town alive  alone
1          1       1  female  38.0      1      0   71.2833        C   First  woman       False    C    Cherbourg   yes  False
3          1       1  female  35.0      1      0   53.1000        S   First  woman       False    C  Southampton   yes  False
6          0       1    male  54.0      0      0   51.8625        S   First    man        True    E  Southampton    no   True
10         1       3  female   4.0      1      1   16.7000        S   Third  child       False    G  Southampton   yes  False
11         1       1  female  58.0      0      0   26.5500        S   First  woman       False    C  Southampton   yes   True
21         1       2    male  34.0      0      0   13.0000        S  Second    man        True    D  Southampton   yes   True
23         1       1    male  28.0      0      0   35.5000        S   First    man        True    A  Southampton   yes   True
27         0       1    male  19.0      3      2  263.0000        S   First    man        True    C  Southampton    no  False
52         1       1  female  49.0      1      0   76.7292        C   First  woman       False    D    Cherbourg   yes  False
54         0       1    male  65.0      0      1   61.9792        C   First    man        True    B    Cherbourg    no  False

The above examples dont change the dataset. They simply return the lines that doesnt contain NaN
To change the column age (i.e to remove all the lines that have a NaN), run titanic['age'].dropna(inplace=True)

delete the columns that contain NaN

>>> titanic.dropna(axis="columns").head()
   survived  pclass     sex  sibsp  parch     fare  class    who  adult_male alive  alone
0         0       3    male      1      0   7.2500  Third    man        True    no  False
1         1       1  female      1      0  71.2833  First  woman       False   yes  False
2         1       3  female      0      0   7.9250  Third  woman       False   yes   True
3         1       1  female      1      0  53.1000  First  woman       False   yes  False
4         0       3    male      0      0   8.0500  Third    man        True    no   True

The above example doesnt change the dataset. It simplies return the colums that doesnt contain NaN
To change it run (i.e to remove the column that contain NaN) titanic.dropna(axis="columns", inplace=True)

rename a column

>>> titanic.rename(columns={"sex":"gender"}).head(5)
   survived  pclass  gender   age  sibsp  parch     fare embarked  class    who  adult_male deck  embark_town alive  alone
0         0       3    male  22.0      1      0   7.2500        S  Third    man        True  NaN  Southampton    no  False
1         1       1  female  38.0      1      0  71.2833        C  First  woman       False    C    Cherbourg   yes  False
2         1       3  female  26.0      0      0   7.9250        S  Third  woman       False  NaN  Southampton   yes   True
3         1       1  female  35.0      1      0  53.1000        S  First  woman       False    C  Southampton   yes  False
4         0       3    male  35.0      0      0   8.0500        S  Third    man        True  NaN  Southampton    no   True

The above example doesnt change the dataset.
To change it, run titanic.rename(columns={"sex":"gender"},inplace=True)

delete the line with an index equal to 0.

>>> titanic.drop(0)

The above example doesnt change the dataset. It simplies return the dataset without the line line with an index equal to 0.
To change the dataset run titanic = titanic.drop(0) or titanic.drop(0, inplace=True)

Deletes the column "age"

>>> titanic.drop(columns=["age"])

The above example doesnt change the dataset. It simplies return the dataset without the column "age".
To change the dataset run titanic = titanic.drop(columns=["age"])

see the distribution of survivors by gender and ticket type

the column survived uses 0s and 1s (0 means died and 1 means survived)
the result is an average
so 50% of females in third class died

>>> titanic.pivot_table('survived', index='sex', columns='class')
class      First    Second     Third
sex
female  0.968085  0.921053  0.500000
male    0.368852  0.157407  0.135447

get the total number of survivors in each case
the column survived uses 0s and 1s
lets use the sum function

>>> titanic.pivot_table('survived', index='sex', columns='class', aggfunc="sum")
class   First  Second  Third
sex
female     91      70     72
male       45      17     47

remove the lines with NaN
group the ages into three categories
use the cut function to segment data values

>>> titanic.dropna(inplace=True)
>>> age = pd.cut(titanic['age'], [0, 18, 80])
>>> titanic.pivot_table('survived', ['sex', age], 'class')
class               First    Second     Third
sex    age
female (0, 18]   0.909091  1.000000  0.500000
       (18, 80]  0.968254  0.875000  0.666667
male   (0, 18]   0.800000  1.000000  1.000000
       (18, 80]  0.397436  0.333333  0.250000
Clone this wiki locally