-
Notifications
You must be signed in to change notification settings - Fork 2
manipulate dataset with pandas
Pandas is a python library for data manipulation
>>> import pandas as pd
>>> bear_family = [[100, 5 , 20, 80], [50 , 2.5, 10, 40], [110, 6 , 22, 80]]
>>> bear_family
[[100, 5, 20, 80], [50, 2.5, 10, 40], [110, 6, 22, 80]]
>>> type(bear_family)
<class 'list'>
>>>
>>> bear_family_df = pd.DataFrame(bear_family)
>>> type(bear_family_df)
<class 'pandas.core.frame.DataFrame'>
>>> bear_family_df
0 1 2 3
0 100 5.0 20 80
1 50 2.5 10 40
2 110 6.0 22 80
>>> bear_family_df = pd.DataFrame(bear_family, index = ['mom', 'baby', 'dad'], columns = ['leg', 'hair','tail', 'belly'])
>>> bear_family_df
leg hair tail belly
mom 100 5.0 20 80
baby 50 2.5 10 40
dad 110 6.0 22 80
>>> bear_family_df.leg
mom 100
baby 50
dad 110
Name: leg, dtype: int64
>>> bear_family_df["leg"]
mom 100
baby 50
dad 110
Name: leg, dtype: int64
>>> bear_family_df["leg"].values
array([100, 50, 110])
by his position (2)
>>> bear_family_df.iloc[2]
leg 110.0
hair 6.0
tail 22.0
belly 80.0
Name: dad, dtype: float64
by his name
>>> bear_family_df.loc["dad"]
leg 110.0
hair 6.0
tail 22.0
belly 80.0
Name: dad, dtype: float64
find out which bear has a leg of 110:
>>> bear_family_df["leg"] == 110
mom False
baby False
dad True
Name: leg, dtype: bool
select the bears that have a leg of 110
>>> bears_110 = bear_family_df[bear_family_df["leg"] == 110]
>>> bears_110
leg hair tail belly
dad 110 6.0 22 80
select the bears that have a belly size of 80
>>> mask = bear_family_df["belly"] == 80
>>> bears_80 = bear_family_df[mask]
>>> bears_80
leg hair tail belly
mom 100 5.0 20 80
dad 110 6.0 22 80
use the operator ~
to select the bears that don't have a belly size of 80
>>> bear_family_df[~mask]
leg hair tail belly
baby 50 2.5 10 40
>>> some_bears = pd.DataFrame([[105,4,19,80],[100,5,20,80]], columns = bear_family_df.columns)
>>> some_bears
leg hair tail belly
0 105 4 19 80
1 100 5 20 80
>>> all_bears = bear_family_df.append(some_bears)
>>> all_bears
leg hair tail belly
mom 100 5.0 20 80
baby 50 2.5 10 40
dad 110 6.0 22 80
0 105 4.0 19 80
1 100 5.0 20 80
In the DataFrame all_bears, the first bear (mom) and the last bear have exactly the same measurements.
>>> all_bears = all_bears.drop_duplicates()
>>> all_bears
leg hair tail belly
mom 100 5.0 20 80
baby 50 2.5 10 40
dad 110 6.0 22 80
0 105 4.0 19 80
>>> bear_family_df.columns
Index(['leg', 'hair', 'tail', 'belly'], dtype='object')
mom and baby are female, dad is male
>>> bear_family_df["sex"] = ["f", "f", "m"]
>>> bear_family_df
leg hair tail belly sex
mom 100 5.0 20 80 f
baby 50 2.5 10 40 f
dad 110 6.0 22 80 m
>>> len(bear_family_df)
3
>>> bear_family_df.belly.unique()
array([80, 40])
>>> import os
>>> os.getcwd()
'/home/ksator'
>>> data = pd.read_csv("seaborn-data/iris.csv", sep=",")
>>> import seaborn as sns
>>> titanic = sns.load_dataset('titanic')
>>> type(titanic)
<class 'pandas.core.frame.DataFrame'>
>>> titanic.head(5)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
>>> titanic.age.head(5)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: age, dtype: float64
>>> titanic.tail(5)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
886 0 2 male 27.0 0 0 13.00 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.00 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.45 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.00 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.75 Q Third man True NaN Queenstown no True
>>> titanic.age.tail(5)
886 27.0
887 19.0
888 NaN
889 26.0
890 32.0
Name: age, dtype: float64
>>> titanic.age.unique()
array([22. , 38. , 26. , 35. , nan, 54. , 2. , 27. , 14. ,
4. , 58. , 20. , 39. , 55. , 31. , 34. , 15. , 28. ,
8. , 19. , 40. , 66. , 42. , 21. , 18. , 3. , 7. ,
49. , 29. , 65. , 28.5 , 5. , 11. , 45. , 17. , 32. ,
16. , 25. , 0.83, 30. , 33. , 23. , 24. , 46. , 59. ,
71. , 37. , 47. , 14.5 , 70.5 , 32.5 , 12. , 9. , 36.5 ,
51. , 55.5 , 40.5 , 44. , 1. , 61. , 56. , 50. , 36. ,
45.5 , 20.5 , 62. , 41. , 52. , 63. , 23.5 , 0.92, 43. ,
60. , 10. , 64. , 13. , 48. , 0.75, 53. , 57. , 80. ,
70. , 24.5 , 6. , 0.67, 30.5 , 0.42, 34.5 , 74. ])
>>> titanic['age'].describe(include="all")
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: age, dtype: float64
>>> titanic.describe(include="all")
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
count 891.000000 891.000000 891 714.000000 891.000000 891.000000 891.000000 889 891 891 891 203 889 891 891
unique NaN NaN 2 NaN NaN NaN NaN 3 3 3 2 7 3 2 2
top NaN NaN male NaN NaN NaN NaN S Third man True C Southampton no True
freq NaN NaN 577 NaN NaN NaN NaN 644 491 537 537 59 644 549 537
mean 0.383838 2.308642 NaN 29.699118 0.523008 0.381594 32.204208 NaN NaN NaN NaN NaN NaN NaN NaN
std 0.486592 0.836071 NaN 14.526497 1.102743 0.806057 49.693429 NaN NaN NaN NaN NaN NaN NaN NaN
min 0.000000 1.000000 NaN 0.420000 0.000000 0.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
25% 0.000000 2.000000 NaN 20.125000 0.000000 0.000000 7.910400 NaN NaN NaN NaN NaN NaN NaN NaN
50% 0.000000 3.000000 NaN 28.000000 0.000000 0.000000 14.454200 NaN NaN NaN NaN NaN NaN NaN NaN
75% 1.000000 3.000000 NaN 38.000000 1.000000 0.000000 31.000000 NaN NaN NaN NaN NaN NaN NaN NaN
max 1.000000 3.000000 NaN 80.000000 8.000000 6.000000 512.329200 NaN NaN NaN NaN NaN NaN NaN NaN
>>> titanic.age.head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
Name: age, dtype: float64
This returns a DataFrame where all NaN in the age column have been replaced by 0.
>>> titanic.fillna(value={"age": 0}).age.head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 0.0
6 54.0
7 2.0
8 27.0
9 14.0
Name: age, dtype: float64
The above example doesnt change the dataset.
To change it run titanic['age'] = titanic.fillna(value={'age': 0}).age
or titanic.fillna(value={'age': 0}, inplace=True)
This returns a DataFrame where all NaN in the age column have been replaced with the previous values
>>> titanic.fillna(method="pad").age.head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 35.0
6 54.0
7 2.0
8 27.0
9 14.0
Name: age, dtype: float64
The above example doesnt change the dataset.
To change it run titanic['age'] = titanic.fillna(method="pad").age
or titanic.fillna(method="pad", inplace=True)
By default, it deletes the lines that contains NaN.
>>> titanic['age'].dropna().head(10)
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
6 54.0
7 2.0
8 27.0
9 14.0
10 4.0
Name: age, dtype: float64
>>> titanic.dropna().head(10)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
6 0 1 male 54.0 0 0 51.8625 S First man True E Southampton no True
10 1 3 female 4.0 1 1 16.7000 S Third child False G Southampton yes False
11 1 1 female 58.0 0 0 26.5500 S First woman False C Southampton yes True
21 1 2 male 34.0 0 0 13.0000 S Second man True D Southampton yes True
23 1 1 male 28.0 0 0 35.5000 S First man True A Southampton yes True
27 0 1 male 19.0 3 2 263.0000 S First man True C Southampton no False
52 1 1 female 49.0 1 0 76.7292 C First woman False D Cherbourg yes False
54 0 1 male 65.0 0 1 61.9792 C First man True B Cherbourg no False
The above examples dont change the dataset. They simply return the lines that doesnt contain NaN
To change the column age (i.e to remove all the lines that have a NaN), run titanic['age'].dropna(inplace=True)
>>> titanic.dropna(axis="columns").head()
survived pclass sex sibsp parch fare class who adult_male alive alone
0 0 3 male 1 0 7.2500 Third man True no False
1 1 1 female 1 0 71.2833 First woman False yes False
2 1 3 female 0 0 7.9250 Third woman False yes True
3 1 1 female 1 0 53.1000 First woman False yes False
4 0 3 male 0 0 8.0500 Third man True no True
The above example doesnt change the dataset. It simplies return the colums that doesnt contain NaN
To change it run (i.e to remove the column that contain NaN) titanic.dropna(axis="columns", inplace=True)
>>> titanic.rename(columns={"sex":"gender"}).head(5)
survived pclass gender age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
The above example doesnt change the dataset.
To change it, run titanic.rename(columns={"sex":"gender"},inplace=True)
>>> titanic.drop(0)
The above example doesnt change the dataset. It simplies return the dataset without the line line with an index equal to 0.
To change the dataset run titanic = titanic.drop(0)
or titanic.drop(0, inplace=True)
>>> titanic.drop(columns=["age"])
The above example doesnt change the dataset. It simplies return the dataset without the column "age".
To change the dataset run titanic = titanic.drop(columns=["age"])
the column survived uses 0s and 1s (0 means died and 1 means survived)
the result is an average
so 50% of females in third class died
>>> titanic.pivot_table('survived', index='sex', columns='class')
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
get the total number of survivors in each case
the column survived uses 0s and 1s
lets use the sum function
>>> titanic.pivot_table('survived', index='sex', columns='class', aggfunc="sum")
class First Second Third
sex
female 91 70 72
male 45 17 47
remove the lines with NaN
group the ages into three categories
use the cut function to segment data values
>>> titanic.dropna(inplace=True)
>>> age = pd.cut(titanic['age'], [0, 18, 80])
>>> titanic.pivot_table('survived', ['sex', age], 'class')
class First Second Third
sex age
female (0, 18] 0.909091 1.000000 0.500000
(18, 80] 0.968254 0.875000 0.666667
male (0, 18] 0.800000 1.000000 1.000000
(18, 80] 0.397436 0.333333 0.250000