## Pandas - podstawowe operacje na obiektach DataFrame

#### Autor: Marian Witkowski marian.witkowski[at]gmail.com

#### Niniejszy materiał może być używany w celach dydaktyczych pod warunkem poinformowania o autorze.

Przykład użycia najczęstszych operacji i technik manulacji na obiektach DataFrame w Pandas:
- ładowanie danych z CSV
- dostęp do kolumn
- indeksowanie/wybieranie danych
- filtrowanie
- zmiana typu danych
- informacje statystyczne o obiekcie
- transformacja i sortowanie danych
- grupowanie
- deduplikacja
- budowanie rankingu
- wypełnianie wartości NaN (missing values)
- łączenie danych

<img src='http://51.91.120.89/itm/p.png' border=0 />

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

## Ładowanie danych do DataFrame

In [2]:
df = pd.read_csv("https://bit.ly/3io3vC6")

### Pokaż początkowe rekordy w DataFrame

In [3]:
df.head(10)

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106
5,Afghanistan,1977,14880372.0,Asia,38.438,786.11336
6,Afghanistan,1982,12881816.0,Asia,39.854,978.011439
7,Afghanistan,1987,13867957.0,Asia,40.822,852.395945
8,Afghanistan,1992,16317921.0,Asia,41.674,649.341395
9,Afghanistan,1997,22227415.0,Asia,41.763,635.341351


### Pokaż końcowe rekordy w DataFrame

In [4]:
df.tail(10) # w parametrze liczba rekordów do pokazania

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
1694,Zimbabwe,1962,4277736.0,Africa,52.358,527.272182
1695,Zimbabwe,1967,4995432.0,Africa,53.995,569.795071
1696,Zimbabwe,1972,5861135.0,Africa,55.635,799.362176
1697,Zimbabwe,1977,6642107.0,Africa,57.674,685.587682
1698,Zimbabwe,1982,7636524.0,Africa,60.363,788.855041
1699,Zimbabwe,1987,9216418.0,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340.0,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948.0,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563.0,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143.0,Africa,43.487,469.709298


### Pobranie danych z DataFrame

In [5]:
# pojedyncza kolumna
df["country"]

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [6]:
# pojedyncza kolumna - notacja kropkowa
df.country

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

In [7]:
# wiele kolumn
df[ ["country","continent","gdpPercap"] ]

Unnamed: 0,country,continent,gdpPercap
0,Afghanistan,Asia,779.445314
1,Afghanistan,Asia,820.853030
2,Afghanistan,Asia,853.100710
3,Afghanistan,Asia,836.197138
4,Afghanistan,Asia,739.981106
...,...,...,...
1699,Zimbabwe,Africa,706.157306
1700,Zimbabwe,Africa,693.420786
1701,Zimbabwe,Africa,792.449960
1702,Zimbabwe,Africa,672.038623


In [8]:
# określenie pobieranych wierszy i kolumn
# pobranie wierszy od indeksu 1 do 10 włącznie i kolumn "country","continent","gdpPercap"
df.loc[ 1:10,  ["country","continent","gdpPercap"]  ]

Unnamed: 0,country,continent,gdpPercap
1,Afghanistan,Asia,820.85303
2,Afghanistan,Asia,853.10071
3,Afghanistan,Asia,836.197138
4,Afghanistan,Asia,739.981106
5,Afghanistan,Asia,786.11336
6,Afghanistan,Asia,978.011439
7,Afghanistan,Asia,852.395945
8,Afghanistan,Asia,649.341395
9,Afghanistan,Asia,635.341351
10,Afghanistan,Asia,726.734055


In [9]:
# pobranie ostatnich 10 wierszy i kolumny o offsecie 0, 2, 4
df.iloc[ -10: , [0,2,4] ]

Unnamed: 0,country,pop,lifeExp
1694,Zimbabwe,4277736.0,52.358
1695,Zimbabwe,4995432.0,53.995
1696,Zimbabwe,5861135.0,55.635
1697,Zimbabwe,6642107.0,57.674
1698,Zimbabwe,7636524.0,60.363
1699,Zimbabwe,9216418.0,62.351
1700,Zimbabwe,10704340.0,60.377
1701,Zimbabwe,11404948.0,46.809
1702,Zimbabwe,11926563.0,39.989
1703,Zimbabwe,12311143.0,43.487


In [10]:
# pobierz wszystkie wiersze i 2 ostatnie kolumny
df.iloc[ : , -2:]

Unnamed: 0,lifeExp,gdpPercap
0,28.801,779.445314
1,30.332,820.853030
2,31.997,853.100710
3,34.020,836.197138
4,36.088,739.981106
...,...,...
1699,62.351,706.157306
1700,60.377,693.420786
1701,46.809,792.449960
1702,39.989,672.038623


In [11]:
# pobierz co 3-ci wiersz i co 2-gą kolumną
df.iloc[ ::3, ::2 ]

Unnamed: 0,country,pop,lifeExp
0,Afghanistan,8425333.0,28.801
3,Afghanistan,11537966.0,34.020
6,Afghanistan,12881816.0,39.854
9,Afghanistan,22227415.0,41.763
12,Albania,1282697.0,55.230
...,...,...,...
1689,Zambia,9417789.0,40.238
1692,Zimbabwe,3080907.0,48.451
1695,Zimbabwe,4995432.0,53.995
1698,Zimbabwe,7636524.0,60.363


### Pobranie informacji o kolumnach

In [12]:
df.columns

Index(['country', 'year', 'pop', 'continent', 'lifeExp', 'gdpPercap'], dtype='object')

### Pobranie informacji o typach danych kolumn

In [13]:
df.dtypes

country       object
year           int64
pop          float64
continent     object
lifeExp      float64
gdpPercap    float64
dtype: object

### Zmiana typu kolumny w DataFrame

In [14]:
df["year"] = df["year"].astype("uint16") # liczba 16-bitowa bez znaku
df["lifeExp"] = df["lifeExp"].astype("float32") # float 32-bitowy

### Metryki statystyczne danych w DataFrame

In [15]:
df.describe()

Unnamed: 0,year,pop,lifeExp,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,29601210.0,59.474384,7215.327081
std,17.26533,106157900.0,12.917107,9857.454543
min,1952.0,60011.0,23.599001,241.165876
25%,1965.75,2793664.0,48.197999,1202.060309
50%,1979.5,7023596.0,60.7125,3531.846988
75%,1993.25,19585220.0,70.845501,9325.462346
max,2007.0,1318683000.0,82.602997,113523.1329


### Ilość miejsca zajmowanego w RAM przez DataFrame

In [16]:
df.memory_usage(deep=True)

Index           128
country      111288
year           3408
pop           13632
continent    107184
lifeExp        6816
gdpPercap     13632
dtype: int64

### Sortowanie danych wg kolumn(y)

In [17]:
df.sort_values('year')

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801001,779.445314
528,France,1952,42459667.0,Europe,67.410004,7029.809327
540,Gabon,1952,420702.0,Africa,37.002998,4293.476475
1656,West Bank and Gaza,1952,1030585.0,Asia,43.160000,1515.592329
552,Gambia,1952,284320.0,Africa,30.000000,485.230659
...,...,...,...,...,...,...
1127,Niger,2007,12894865.0,Africa,56.867001,619.676892
1139,Nigeria,2007,135031164.0,Africa,46.859001,2013.977305
1151,Norway,2007,4627926.0,Europe,80.195999,49357.190170
1175,Pakistan,2007,169270617.0,Asia,65.483002,2605.947580


In [18]:
df.sort_values(['country','pop'], ascending=[True,False])

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
11,Afghanistan,2007,31889923.0,Asia,43.827999,974.580338
10,Afghanistan,2002,25268405.0,Asia,42.129002,726.734055
9,Afghanistan,1997,22227415.0,Asia,41.763000,635.341351
8,Afghanistan,1992,16317921.0,Asia,41.674000,649.341395
5,Afghanistan,1977,14880372.0,Asia,38.438000,786.113360
...,...,...,...,...,...,...
1696,Zimbabwe,1972,5861135.0,Africa,55.634998,799.362176
1695,Zimbabwe,1967,4995432.0,Africa,53.994999,569.795071
1694,Zimbabwe,1962,4277736.0,Africa,52.358002,527.272182
1693,Zimbabwe,1957,3646340.0,Africa,50.469002,518.764268


### Sortowanie danych wg indeksu

In [19]:
df.sort_index(ascending=False)

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
1703,Zimbabwe,2007,12311143.0,Africa,43.487000,469.709298
1702,Zimbabwe,2002,11926563.0,Africa,39.988998,672.038623
1701,Zimbabwe,1997,11404948.0,Africa,46.808998,792.449960
1700,Zimbabwe,1992,10704340.0,Africa,60.376999,693.420786
1699,Zimbabwe,1987,9216418.0,Africa,62.351002,706.157306
...,...,...,...,...,...,...
4,Afghanistan,1972,13079460.0,Asia,36.088001,739.981106
3,Afghanistan,1967,11537966.0,Asia,34.020000,836.197138
2,Afghanistan,1962,10267083.0,Asia,31.997000,853.100710
1,Afghanistan,1957,9240934.0,Asia,30.332001,820.853030


### Pobieranie największych wartości na podstawie kolumny

In [20]:
df.nlargest(5, 'pop')

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
299,China,2007,1318683000.0,Asia,72.960999,4959.114854
298,China,2002,1280400000.0,Asia,72.028,3119.280896
297,China,1997,1230075000.0,Asia,70.426003,2289.234136
296,China,1992,1164970000.0,Asia,68.690002,1655.784158
707,India,2007,1110396000.0,Asia,64.697998,2452.210407


### Pobieranie najmniejszych wartości na podstawie kolumny


In [21]:
df.nsmallest(5, 'lifeExp')

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
1292,Rwanda,1992,7290203.0,Africa,23.599001,737.068595
0,Afghanistan,1952,8425333.0,Asia,28.801001,779.445314
552,Gambia,1952,284320.0,Africa,30.0,485.230659
36,Angola,1952,4232095.0,Africa,30.014999,3520.610273
1344,Sierra Leone,1952,2143249.0,Africa,30.330999,879.787736


### Zmiana nazw kolumn

In [22]:
# wszystkie kolumny zapisane małymi literami
df.columns = df.columns.str.lower()
df.head(3)

Unnamed: 0,country,year,pop,continent,lifeexp,gdppercap
0,Afghanistan,1952,8425333.0,Asia,28.801001,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332001,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071


In [23]:
# użycie dedykowane metody do zmiany nazwy kolumn(y)
df.rename(columns={"country":"Country", "gdppercap":"GDP_per_cap"}, inplace=True)
df.head(3)

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap
0,Afghanistan,1952,8425333.0,Asia,28.801001,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332001,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071


### Przekształcanie zawartości kolumny

Dodatkowa kolumna "pop_range" będzie przyjmować wartość z zakresu L,M,H w zależności o wartości kolumny "pop" wg reguły:

- L dla <50mln
- M dla >=50mln i <100mln
- H dla >=100mln



In [24]:
def apply_pop(x):
    if x<50_000_000:
        return "L"
    if x<100_000_000:
        return "M"
    return "H"

df["pop_range"] = df["pop"].apply(apply_pop)
df.head(3)

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,Afghanistan,1952,8425333.0,Asia,28.801001,779.445314,L
1,Afghanistan,1957,9240934.0,Asia,30.332001,820.85303,L
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071,L


### Mapowanie wartości w kolumnie

zmapuj kolumnę "pop_range" wg schematu: dla L wartość Low, dla M - Medium, dla H - High

In [25]:
df.pop_range = df.pop_range.map({
    "L" : "Low", "M" : "Medium", "H" : "High"
})
df.head()

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,Afghanistan,1952,8425333.0,Asia,28.801001,779.445314,Low
1,Afghanistan,1957,9240934.0,Asia,30.332001,820.85303,Low
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071,Low
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138,Low
4,Afghanistan,1972,13079460.0,Asia,36.088001,739.981106,Low


### Usuwanie kolumn(y)

In [26]:
df.drop(columns='pop') # pojedyncza kolumna

Unnamed: 0,Country,year,continent,lifeexp,GDP_per_cap,pop_range
0,Afghanistan,1952,Asia,28.801001,779.445314,Low
1,Afghanistan,1957,Asia,30.332001,820.853030,Low
2,Afghanistan,1962,Asia,31.997000,853.100710,Low
3,Afghanistan,1967,Asia,34.020000,836.197138,Low
4,Afghanistan,1972,Asia,36.088001,739.981106,Low
...,...,...,...,...,...,...
1699,Zimbabwe,1987,Africa,62.351002,706.157306,Low
1700,Zimbabwe,1992,Africa,60.376999,693.420786,Low
1701,Zimbabwe,1997,Africa,46.808998,792.449960,Low
1702,Zimbabwe,2002,Africa,39.988998,672.038623,Low


In [27]:
df.drop(columns=['pop','continent']) # wiele kolumn

Unnamed: 0,Country,year,lifeexp,GDP_per_cap,pop_range
0,Afghanistan,1952,28.801001,779.445314,Low
1,Afghanistan,1957,30.332001,820.853030,Low
2,Afghanistan,1962,31.997000,853.100710,Low
3,Afghanistan,1967,34.020000,836.197138,Low
4,Afghanistan,1972,36.088001,739.981106,Low
...,...,...,...,...,...
1699,Zimbabwe,1987,62.351002,706.157306,Low
1700,Zimbabwe,1992,60.376999,693.420786,Low
1701,Zimbabwe,1997,46.808998,792.449960,Low
1702,Zimbabwe,2002,39.988998,672.038623,Low


### Przekształcanie kolumny typu object z wykorzystanie funkcji obiektów string

In [28]:
df.Country = df.Country.str.upper() # .str - rzutowanie na obiekt klasy string
df.head()

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,AFGHANISTAN,1952,8425333.0,Asia,28.801001,779.445314,Low
1,AFGHANISTAN,1957,9240934.0,Asia,30.332001,820.85303,Low
2,AFGHANISTAN,1962,10267083.0,Asia,31.997,853.10071,Low
3,AFGHANISTAN,1967,11537966.0,Asia,34.02,836.197138,Low
4,AFGHANISTAN,1972,13079460.0,Asia,36.088001,739.981106,Low


### Pobranie losowych wierszy z DataFrame

In [29]:
# ilościowo - z parametrem "n"
df.sample(n=10, random_state=42)

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
1046,MYANMAR,1962,23634436.0,Asia,45.108002,388.0,Low
745,IRELAND,1957,2878220.0,Europe,68.900002,5599.077872,Low
785,JAMAICA,1977,2156814.0,Americas,70.110001,6650.195573,Low
367,COTE D'IVOIRE,1987,10761098.0,Africa,54.654999,2156.956069,Low
1029,MOROCCO,1997,28529501.0,Africa,67.660004,2982.101858,Low
1648,VIETNAM,1972,44655014.0,Asia,50.254002,699.501644,Low
259,CENTRAL AFRICAN REPUBLIC,1987,2840009.0,Africa,50.485001,844.87635,Low
1509,TAIWAN,1997,21628605.0,Asia,75.25,20206.82098,Low
514,ETHIOPIA,2002,67946797.0,Africa,50.724998,530.053532,Medium
1229,POLAND,1977,34621254.0,Europe,70.669998,9508.141454,Low


In [30]:
# procentowo - z parametrem "frac"
df.sample(frac=0.5/100, random_state=42)

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
1046,MYANMAR,1962,23634436.0,Asia,45.108002,388.0,Low
745,IRELAND,1957,2878220.0,Europe,68.900002,5599.077872,Low
785,JAMAICA,1977,2156814.0,Americas,70.110001,6650.195573,Low
367,COTE D'IVOIRE,1987,10761098.0,Africa,54.654999,2156.956069,Low
1029,MOROCCO,1997,28529501.0,Africa,67.660004,2982.101858,Low
1648,VIETNAM,1972,44655014.0,Asia,50.254002,699.501644,Low
259,CENTRAL AFRICAN REPUBLIC,1987,2840009.0,Africa,50.485001,844.87635,Low
1509,TAIWAN,1997,21628605.0,Asia,75.25,20206.82098,Low
514,ETHIOPIA,2002,67946797.0,Africa,50.724998,530.053532,Medium


### Pobranie wartości unikalnym z kolumny

In [31]:
df.Country.unique()

array(['AFGHANISTAN', 'ALBANIA', 'ALGERIA', 'ANGOLA', 'ARGENTINA',
       'AUSTRALIA', 'AUSTRIA', 'BAHRAIN', 'BANGLADESH', 'BELGIUM',
       'BENIN', 'BOLIVIA', 'BOSNIA AND HERZEGOVINA', 'BOTSWANA', 'BRAZIL',
       'BULGARIA', 'BURKINA FASO', 'BURUNDI', 'CAMBODIA', 'CAMEROON',
       'CANADA', 'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHILE', 'CHINA',
       'COLOMBIA', 'COMOROS', 'CONGO DEM. REP.', 'CONGO REP.',
       'COSTA RICA', "COTE D'IVOIRE", 'CROATIA', 'CUBA', 'CZECH REPUBLIC',
       'DENMARK', 'DJIBOUTI', 'DOMINICAN REPUBLIC', 'ECUADOR', 'EGYPT',
       'EL SALVADOR', 'EQUATORIAL GUINEA', 'ERITREA', 'ETHIOPIA',
       'FINLAND', 'FRANCE', 'GABON', 'GAMBIA', 'GERMANY', 'GHANA',
       'GREECE', 'GUATEMALA', 'GUINEA', 'GUINEA-BISSAU', 'HAITI',
       'HONDURAS', 'HONG KONG CHINA', 'HUNGARY', 'ICELAND', 'INDIA',
       'INDONESIA', 'IRAN', 'IRAQ', 'IRELAND', 'ISRAEL', 'ITALY',
       'JAMAICA', 'JAPAN', 'JORDAN', 'KENYA', 'KOREA DEM. REP.',
       'KOREA REP.', 'KUWAIT', 'LEBANON',

### Pobranie statystyki wartości unikalnych

In [32]:
df.pop_range.value_counts()

Low       1514
Medium     113
High        77
Name: pop_range, dtype: int64

In [33]:
df.pop_range.value_counts(normalize=True)

Low       0.888498
Medium    0.066315
High      0.045188
Name: pop_range, dtype: float64

### Ustawienie nowego indeksu dla obiektu DataFrame na podstawie kolumny

In [34]:
df.set_index("Country", inplace=True)
df.head()

Unnamed: 0_level_0,year,pop,continent,lifeexp,GDP_per_cap,pop_range
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AFGHANISTAN,1952,8425333.0,Asia,28.801001,779.445314,Low
AFGHANISTAN,1957,9240934.0,Asia,30.332001,820.85303,Low
AFGHANISTAN,1962,10267083.0,Asia,31.997,853.10071,Low
AFGHANISTAN,1967,11537966.0,Asia,34.02,836.197138,Low
AFGHANISTAN,1972,13079460.0,Asia,36.088001,739.981106,Low


### Usuwanie/reset indeksu dla obiektu DataFrame

In [35]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,AFGHANISTAN,1952,8425333.0,Asia,28.801001,779.445314,Low
1,AFGHANISTAN,1957,9240934.0,Asia,30.332001,820.85303,Low
2,AFGHANISTAN,1962,10267083.0,Asia,31.997,853.10071,Low
3,AFGHANISTAN,1967,11537966.0,Asia,34.02,836.197138,Low
4,AFGHANISTAN,1972,13079460.0,Asia,36.088001,739.981106,Low


### Filtrowanie danych

Przy wykorzystaniu operatorów bitowych

In [36]:
df[ (df.year>1957) & (df.lifeexp>40 )& ~(df.pop_range=='Low') & 
   ((df.continent=='Asia')|(df.continent=='Europe')) ]

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
98,BANGLADESH,1962,56839289.0,Asia,41.216000,686.341554,Medium
99,BANGLADESH,1967,62821884.0,Asia,43.452999,721.186086,Medium
100,BANGLADESH,1972,70759295.0,Asia,45.251999,630.233627,Medium
101,BANGLADESH,1977,80428306.0,Asia,46.923000,659.877232,Medium
102,BANGLADESH,1982,93074406.0,Asia,50.008999,676.981866,Medium
...,...,...,...,...,...,...,...
1651,VIETNAM,1987,62826491.0,Asia,62.820000,820.799445,Medium
1652,VIETNAM,1992,69940728.0,Asia,67.662003,989.023149,Medium
1653,VIETNAM,1997,76048996.0,Asia,70.671997,1385.896769,Medium
1654,VIETNAM,2002,80908147.0,Asia,73.016998,1764.456677,Medium


Przy wykorzystaniu kwerendy

In [37]:
df.query("year>1957 and lifeexp>40 and \
         not pop_range=='Low' and continent in ['Europe','Asia'] ")

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
98,BANGLADESH,1962,56839289.0,Asia,41.216000,686.341554,Medium
99,BANGLADESH,1967,62821884.0,Asia,43.452999,721.186086,Medium
100,BANGLADESH,1972,70759295.0,Asia,45.251999,630.233627,Medium
101,BANGLADESH,1977,80428306.0,Asia,46.923000,659.877232,Medium
102,BANGLADESH,1982,93074406.0,Asia,50.008999,676.981866,Medium
...,...,...,...,...,...,...,...
1651,VIETNAM,1987,62826491.0,Asia,62.820000,820.799445,Medium
1652,VIETNAM,1992,69940728.0,Asia,67.662003,989.023149,Medium
1653,VIETNAM,1997,76048996.0,Asia,70.671997,1385.896769,Medium
1654,VIETNAM,2002,80908147.0,Asia,73.016998,1764.456677,Medium


### Grupowanie danych

In [38]:
dfg = df.groupby('continent')

Zliczenie elementów w poszczególnych grupach

In [39]:
dfg.count()

Unnamed: 0_level_0,Country,year,pop,lifeexp,GDP_per_cap,pop_range
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Africa,624,624,624,624,624,624
Americas,300,300,300,300,300,300
Asia,396,396,396,396,396,396
Europe,360,360,360,360,360,360
Oceania,24,24,24,24,24,24


Pobranie elementów z danej grupy

In [40]:
dfg.get_group('Oceania')

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
60,AUSTRALIA,1952,8691212.0,Oceania,69.120003,10039.59564,Low
61,AUSTRALIA,1957,9712569.0,Oceania,70.330002,10949.64959,Low
62,AUSTRALIA,1962,10794968.0,Oceania,70.93,12217.22686,Low
63,AUSTRALIA,1967,11872264.0,Oceania,71.099998,14526.12465,Low
64,AUSTRALIA,1972,13177000.0,Oceania,71.93,16788.62948,Low
65,AUSTRALIA,1977,14074100.0,Oceania,73.489998,18334.19751,Low
66,AUSTRALIA,1982,15184200.0,Oceania,74.739998,19477.00928,Low
67,AUSTRALIA,1987,16257249.0,Oceania,76.32,21888.88903,Low
68,AUSTRALIA,1992,17481977.0,Oceania,77.559998,23424.76683,Low
69,AUSTRALIA,1997,18565243.0,Oceania,78.830002,26997.93657,Low


Wykonanie funkcji agregującej - wariant 1

In [41]:
dfg["GDP_per_cap"].mean()

continent
Africa       2193.754578
Americas     7136.110356
Asia         7902.150428
Europe      14469.475533
Oceania     18621.609223
Name: GDP_per_cap, dtype: float64

Wykonanie funkcji agregującej - wariant 2

In [42]:
dfg["GDP_per_cap"].agg([np.std, np.mean])

Unnamed: 0_level_0,std,mean
continent,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa,2827.929863,2193.754578
Americas,6396.764112,7136.110356
Asia,14045.373112,7902.150428
Europe,9355.213498,14469.475533
Oceania,6358.983321,18621.609223


Wykonanie funkcji agregującej - wariant 3

In [43]:
dfg.agg({
    "GDP_per_cap" : [np.min, np.max],
    "lifeexp" : [np.mean, np.median]
})

Unnamed: 0_level_0,GDP_per_cap,GDP_per_cap,lifeexp,lifeexp
Unnamed: 0_level_1,amin,amax,mean,median
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Africa,241.165876,21951.21176,48.86533,47.792
Americas,1201.637154,42951.65309,64.658737,67.048004
Asia,331.0,113523.1329,60.064903,61.7915
Europe,973.533195,49357.19017,71.903687,72.240997
Oceania,10039.59564,34435.36744,74.32621,73.664993


Pobranie top wartości z każdej grupy dla wartości z zadanej kolumny

In [44]:
df.sort_values('lifeexp', ascending=False).groupby('continent').first()

Unnamed: 0_level_0,Country,year,pop,lifeexp,GDP_per_cap,pop_range
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Africa,REUNION,2007,798094.0,76.442001,7670.122558,Low
Americas,CANADA,2007,33390141.0,80.653,36319.23501,Low
Asia,JAPAN,2007,127467972.0,82.602997,31656.06806,High
Europe,ICELAND,2007,301931.0,81.757004,36180.78919,Low
Oceania,AUSTRALIA,2007,20434176.0,81.235001,34435.36744,Low


Grupowanie z wykorzystaniem wielu kolumn

In [45]:
df.groupby(['continent','year']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,pop,lifeexp,GDP_per_cap
continent,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Africa,1952,4570010.0,39.135498,1252.572466
Africa,1957,5093033.0,41.266346,1385.236062
Africa,1962,5702247.0,43.319443,1598.078825
Africa,1967,6447875.0,45.334538,2050.363801
Africa,1972,7305376.0,47.450943,2339.615674
Africa,1977,8328097.0,49.580421,2585.938508
Africa,1982,9602857.0,51.592865,2481.59296
Africa,1987,11054500.0,53.344788,2282.668991
Africa,1992,12674640.0,53.629578,2281.810333
Africa,1997,14304480.0,53.59827,2378.759555


### Deduplikacja danych

Znajdowanie duplikatów

In [46]:
df.duplicated().sum() #liczba duplikatów

0

Znajdowanie duplikatów na podstawie określonych kolumns

In [47]:
df.duplicated(subset='Country').sum()

1562

Pokaż zduplikowane rekordy

In [48]:
df[df.duplicated(subset='Country')]

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
1,AFGHANISTAN,1957,9240934.0,Asia,30.332001,820.853030,Low
2,AFGHANISTAN,1962,10267083.0,Asia,31.997000,853.100710,Low
3,AFGHANISTAN,1967,11537966.0,Asia,34.020000,836.197138,Low
4,AFGHANISTAN,1972,13079460.0,Asia,36.088001,739.981106,Low
5,AFGHANISTAN,1977,14880372.0,Asia,38.438000,786.113360,Low
...,...,...,...,...,...,...,...
1699,ZIMBABWE,1987,9216418.0,Africa,62.351002,706.157306,Low
1700,ZIMBABWE,1992,10704340.0,Africa,60.376999,693.420786,Low
1701,ZIMBABWE,1997,11404948.0,Africa,46.808998,792.449960,Low
1702,ZIMBABWE,2002,11926563.0,Africa,39.988998,672.038623,Low


Usuń duplikaty na podstawie kryterium kolumn, pozostaw tylko ostanie wystąpienie

In [49]:
df.drop_duplicates(subset='Country', keep='last') # parametr keep określa strategię usuwania

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
11,AFGHANISTAN,2007,31889923.0,Asia,43.827999,974.580338,Low
23,ALBANIA,2007,3600523.0,Europe,76.422997,5937.029526,Low
35,ALGERIA,2007,33333216.0,Africa,72.301003,6223.367465,Low
47,ANGOLA,2007,12420476.0,Africa,42.730999,4797.231267,Low
59,ARGENTINA,2007,40301927.0,Americas,75.320000,12779.379640,Low
...,...,...,...,...,...,...,...
1655,VIETNAM,2007,85262356.0,Asia,74.249001,2441.576404,Medium
1667,WEST BANK AND GAZA,2007,4018332.0,Asia,73.421997,3025.349798,Low
1679,YEMEN REP.,2007,22211743.0,Asia,62.698002,2280.769906,Low
1691,ZAMBIA,2007,11746035.0,Africa,42.383999,1271.211593,Low


### Budowanie rankingu

In [50]:
df_tmp = df.query("year==2007")
df_tmp["ranking"] = df_tmp.loc[:, "lifeexp"].rank(ascending=False) # ascending=False - większa wartość = wyższa pozycja
df_tmp.sort_values('ranking')

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range,ranking
803,JAPAN,2007,127467972.0,Asia,82.602997,31656.068060,High,1.0
671,HONG KONG CHINA,2007,6980412.0,Asia,82.208000,39724.978670,Low,2.0
695,ICELAND,2007,301931.0,Europe,81.757004,36180.789190,Low,3.0
1487,SWITZERLAND,2007,7554661.0,Europe,81.700996,37506.419070,Low,4.0
71,AUSTRALIA,2007,20434176.0,Oceania,81.235001,34435.367440,Low,5.0
...,...,...,...,...,...,...,...,...
887,LESOTHO,2007,2012649.0,Africa,42.591999,1569.331442,Low,138.0
1355,SIERRA LEONE,2007,6144562.0,Africa,42.568001,862.540756,Low,139.0
1691,ZAMBIA,2007,11746035.0,Africa,42.383999,1271.211593,Low,140.0
1043,MOZAMBIQUE,2007,19951656.0,Africa,42.082001,823.685621,Low,141.0


### Obsługa wartości brakujących (missing values)

In [51]:
df_tmp = df.sample(n=15, random_state=42, ignore_index=True)

np.random.seed(1) # ustawienie ziarna pseudolosowości

arr = np.random.randint(0, len(df_tmp), 5) # wylosuj 5 pseudolosowych wartości
df_tmp.loc[ arr , "GDP_per_cap" ] = np.NaN # i przypisz w kolumnie GDP_per_cap wartość NaN

arr = np.random.randint(0, len(df_tmp), 5) # wylosuj 5 pseudolosowych wartości
df_tmp.loc[ arr , "pop" ] = np.NaN # # i przypisz w kolumnie GDP_per_cap wartość NaN

df_tmp[-1:] = (np.NaN,)*7 # ostatni wiersz zawierający wyłącznie NaN
df_tmp

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,MYANMAR,1962.0,,Asia,45.108002,388.0,Low
1,IRELAND,1957.0,,Europe,68.900002,5599.077872,Low
2,JAMAICA,1977.0,2156814.0,Americas,70.110001,6650.195573,Low
3,COTE D'IVOIRE,1987.0,10761098.0,Africa,54.654999,2156.956069,Low
4,MOROCCO,1997.0,28529501.0,Africa,67.660004,2982.101858,Low
5,VIETNAM,1972.0,,Asia,50.254002,,Low
6,CENTRAL AFRICAN REPUBLIC,1987.0,2840009.0,Africa,50.485001,844.87635,Low
7,TAIWAN,1997.0,21628605.0,Asia,75.25,20206.82098,Low
8,ETHIOPIA,2002.0,67946797.0,Africa,50.724998,,Medium
9,POLAND,1977.0,34621254.0,Europe,70.669998,,Low


Znalazienie ilości NaN w DataFrame

In [52]:
df_tmp.isna().sum()

Country        1
year           1
pop            5
continent      1
lifeexp        1
GDP_per_cap    6
pop_range      1
dtype: int64

Usunięcie wierszy zawierających przynajmniej jedną wartość NaN

In [53]:
df_tmp.dropna(axis=0, how='any')

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
2,JAMAICA,1977.0,2156814.0,Americas,70.110001,6650.195573,Low
3,COTE D'IVOIRE,1987.0,10761098.0,Africa,54.654999,2156.956069,Low
4,MOROCCO,1997.0,28529501.0,Africa,67.660004,2982.101858,Low
6,CENTRAL AFRICAN REPUBLIC,1987.0,2840009.0,Africa,50.485001,844.87635,Low
7,TAIWAN,1997.0,21628605.0,Asia,75.25,20206.82098,Low
10,ALBANIA,2007.0,3600523.0,Europe,76.422997,5937.029526,Low
13,MADAGASCAR,1962.0,5703324.0,Africa,40.848,1643.38711,Low


Usunięcie wierszy zawierających wszystkie wartości NaN

In [54]:
df_tmp.dropna(axis=0, how='all')

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,MYANMAR,1962.0,,Asia,45.108002,388.0,Low
1,IRELAND,1957.0,,Europe,68.900002,5599.077872,Low
2,JAMAICA,1977.0,2156814.0,Americas,70.110001,6650.195573,Low
3,COTE D'IVOIRE,1987.0,10761098.0,Africa,54.654999,2156.956069,Low
4,MOROCCO,1997.0,28529501.0,Africa,67.660004,2982.101858,Low
5,VIETNAM,1972.0,,Asia,50.254002,,Low
6,CENTRAL AFRICAN REPUBLIC,1987.0,2840009.0,Africa,50.485001,844.87635,Low
7,TAIWAN,1997.0,21628605.0,Asia,75.25,20206.82098,Low
8,ETHIOPIA,2002.0,67946797.0,Africa,50.724998,,Medium
9,POLAND,1977.0,34621254.0,Europe,70.669998,,Low


Usunięcie kolumn zawierających przynajmniej jedną wartość NaN

In [55]:
df_tmp = df_tmp.iloc[:-1,]
df_tmp.dropna(axis=1, how='any')

Unnamed: 0,Country,year,continent,lifeexp,pop_range
0,MYANMAR,1962.0,Asia,45.108002,Low
1,IRELAND,1957.0,Europe,68.900002,Low
2,JAMAICA,1977.0,Americas,70.110001,Low
3,COTE D'IVOIRE,1987.0,Africa,54.654999,Low
4,MOROCCO,1997.0,Africa,67.660004,Low
5,VIETNAM,1972.0,Asia,50.254002,Low
6,CENTRAL AFRICAN REPUBLIC,1987.0,Africa,50.485001,Low
7,TAIWAN,1997.0,Asia,75.25,Low
8,ETHIOPIA,2002.0,Africa,50.724998,Medium
9,POLAND,1977.0,Europe,70.669998,Low


Zastąpienie wartości NaN w kolumnie "pop" wartością zero

In [56]:
df_tmp["pop"] = df_tmp["pop"].fillna(0)
df_tmp

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,MYANMAR,1962.0,0.0,Asia,45.108002,388.0,Low
1,IRELAND,1957.0,0.0,Europe,68.900002,5599.077872,Low
2,JAMAICA,1977.0,2156814.0,Americas,70.110001,6650.195573,Low
3,COTE D'IVOIRE,1987.0,10761098.0,Africa,54.654999,2156.956069,Low
4,MOROCCO,1997.0,28529501.0,Africa,67.660004,2982.101858,Low
5,VIETNAM,1972.0,0.0,Asia,50.254002,,Low
6,CENTRAL AFRICAN REPUBLIC,1987.0,2840009.0,Africa,50.485001,844.87635,Low
7,TAIWAN,1997.0,21628605.0,Asia,75.25,20206.82098,Low
8,ETHIOPIA,2002.0,67946797.0,Africa,50.724998,,Medium
9,POLAND,1977.0,34621254.0,Europe,70.669998,,Low


Zastąpienie wartości NaN w kolumnie "GDP_per_cap" ostatnią poprzednią wartością nie będącą NaN

In [57]:
df_tmp["GDP_per_cap"] = df_tmp["GDP_per_cap"].fillna(method='ffill')
df_tmp

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,MYANMAR,1962.0,0.0,Asia,45.108002,388.0,Low
1,IRELAND,1957.0,0.0,Europe,68.900002,5599.077872,Low
2,JAMAICA,1977.0,2156814.0,Americas,70.110001,6650.195573,Low
3,COTE D'IVOIRE,1987.0,10761098.0,Africa,54.654999,2156.956069,Low
4,MOROCCO,1997.0,28529501.0,Africa,67.660004,2982.101858,Low
5,VIETNAM,1972.0,0.0,Asia,50.254002,2982.101858,Low
6,CENTRAL AFRICAN REPUBLIC,1987.0,2840009.0,Africa,50.485001,844.87635,Low
7,TAIWAN,1997.0,21628605.0,Asia,75.25,20206.82098,Low
8,ETHIOPIA,2002.0,67946797.0,Africa,50.724998,20206.82098,Medium
9,POLAND,1977.0,34621254.0,Europe,70.669998,20206.82098,Low


### Piwot obiektu DataFrame

In [58]:
pd.pivot_table(data=df, index="continent", columns="year", values="GDP_per_cap")

year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Africa,1252.572466,1385.236062,1598.078825,2050.363801,2339.615674,2585.938508,2481.59296,2282.668991,2281.810333,2378.759555,2599.385159,3089.032605
Americas,4079.062552,4616.043733,4901.54187,5668.253496,6491.334139,7352.007126,7506.737088,7793.400261,8044.934406,8889.300863,9287.677107,11003.031625
Asia,5195.484004,5787.73294,5729.369625,5971.173374,8187.468699,7791.31402,7434.135157,7608.226508,8639.690248,9834.093295,10174.090397,12473.02687
Europe,5661.057435,6963.012816,8365.486814,10143.823757,12479.575246,14283.97911,15617.896551,17214.310727,17061.568084,19076.781802,21711.732422,25054.481636
Oceania,10298.08565,11598.522455,12696.45243,14495.02179,16417.33338,17283.957605,18554.70984,20448.04016,20894.045885,24024.17517,26938.77804,29810.188275


In [59]:
pd.pivot_table(data=df, index="continent", columns="year", aggfunc={
    "GDP_per_cap" : [np.median, np.max]
})

Unnamed: 0_level_0,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap,GDP_per_cap
Unnamed: 0_level_1,amax,amax,amax,amax,amax,amax,amax,amax,amax,amax,...,median,median,median,median,median,median,median,median,median,median
year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,...,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
continent,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
Africa,4725.295531,5487.104219,6757.030816,18772.75169,21011.49721,21951.21176,17364.27538,11864.40844,13522.15752,14722.84188,...,1133.783678,1210.376379,1443.372508,1399.638836,1323.728306,1219.585563,1161.631357,1179.883114,1215.683217,1452.267078
Americas,13990.48208,14847.12712,16173.14586,19530.36557,21806.03594,24072.63213,25009.55914,29884.35041,32003.93224,35767.43303,...,4086.114078,4643.393534,5305.445256,6281.290855,6434.501797,6360.943444,6618.74305,7113.692252,6994.774861,8948.102923
Asia,108382.3529,113523.1329,95458.11176,80894.88326,109347.867,59265.47714,33693.17525,28118.42998,34932.91959,40300.61996,...,1649.552153,2029.228142,2571.423014,3195.484582,4106.525293,4106.492315,3726.063507,3645.379572,4090.925331,4471.061906
Europe,14734.23275,17909.48973,20431.0927,22966.14432,27195.11304,26982.29052,28397.71512,31540.9748,33965.66115,41283.16433,...,7515.733737,9366.067033,12326.37999,14225.754515,15322.82472,16215.485895,17550.155945,19596.49855,23674.86323,28054.06579
Oceania,10556.57566,12247.39532,13175.678,14526.12465,16788.62948,18334.19751,19477.00928,21888.88903,23424.76683,26997.93657,...,12696.45243,14495.02179,16417.33338,17283.957605,18554.70984,20448.04016,20894.045885,24024.17517,26938.77804,29810.188275


### Przesuwanie danych (data shifting)

In [60]:
df_tmp = df.query("Country=='POLAND'").sort_values('year')
df_tmp

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
1224,POLAND,1952,25730551.0,Europe,61.310001,4029.329699,Low
1225,POLAND,1957,28235346.0,Europe,65.769997,4734.253019,Low
1226,POLAND,1962,30329617.0,Europe,67.639999,5338.752143,Low
1227,POLAND,1967,31785378.0,Europe,69.610001,6557.152776,Low
1228,POLAND,1972,33039545.0,Europe,70.849998,8006.506993,Low
1229,POLAND,1977,34621254.0,Europe,70.669998,9508.141454,Low
1230,POLAND,1982,36227381.0,Europe,71.32,8451.531004,Low
1231,POLAND,1987,37740710.0,Europe,70.980003,9082.351172,Low
1232,POLAND,1992,38370697.0,Europe,70.989998,7738.881247,Low
1233,POLAND,1997,38654957.0,Europe,72.75,10159.58368,Low


Porównanie liczby ludności - kolumna 'pop' - rok do roku

<a href='https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html'>https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html</a>

In [61]:
df_tmp["pop_change"] = df_tmp["pop"] - df_tmp["pop"].shift(1)
df_tmp

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range,pop_change
1224,POLAND,1952,25730551.0,Europe,61.310001,4029.329699,Low,
1225,POLAND,1957,28235346.0,Europe,65.769997,4734.253019,Low,2504795.0
1226,POLAND,1962,30329617.0,Europe,67.639999,5338.752143,Low,2094271.0
1227,POLAND,1967,31785378.0,Europe,69.610001,6557.152776,Low,1455761.0
1228,POLAND,1972,33039545.0,Europe,70.849998,8006.506993,Low,1254167.0
1229,POLAND,1977,34621254.0,Europe,70.669998,9508.141454,Low,1581709.0
1230,POLAND,1982,36227381.0,Europe,71.32,8451.531004,Low,1606127.0
1231,POLAND,1987,37740710.0,Europe,70.980003,9082.351172,Low,1513329.0
1232,POLAND,1992,38370697.0,Europe,70.989998,7738.881247,Low,629987.0
1233,POLAND,1997,38654957.0,Europe,72.75,10159.58368,Low,284260.0


### Łączenie obiektów DataFrame

Łączenie wierszami

In [64]:
df1 = df.iloc[:10]  #pobierz pierwsze 10 wierszy
df2 = df.iloc[-10:] #pobierz ostatnie 10 wierszy
df1.append(df2, ignore_index=True)

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,AFGHANISTAN,1952,8425333.0,Asia,28.801001,779.445314,Low
1,AFGHANISTAN,1957,9240934.0,Asia,30.332001,820.85303,Low
2,AFGHANISTAN,1962,10267083.0,Asia,31.997,853.10071,Low
3,AFGHANISTAN,1967,11537966.0,Asia,34.02,836.197138,Low
4,AFGHANISTAN,1972,13079460.0,Asia,36.088001,739.981106,Low
5,AFGHANISTAN,1977,14880372.0,Asia,38.438,786.11336,Low
6,AFGHANISTAN,1982,12881816.0,Asia,39.854,978.011439,Low
7,AFGHANISTAN,1987,13867957.0,Asia,40.821999,852.395945,Low
8,AFGHANISTAN,1992,16317921.0,Asia,41.674,649.341395,Low
9,AFGHANISTAN,1997,22227415.0,Asia,41.763,635.341351,Low


Łączenie wierszami z wykorzystaniem pd.concat()

In [65]:
pd.concat([df1, df2], axis=0, ignore_index=True)

Unnamed: 0,Country,year,pop,continent,lifeexp,GDP_per_cap,pop_range
0,AFGHANISTAN,1952,8425333.0,Asia,28.801001,779.445314,Low
1,AFGHANISTAN,1957,9240934.0,Asia,30.332001,820.85303,Low
2,AFGHANISTAN,1962,10267083.0,Asia,31.997,853.10071,Low
3,AFGHANISTAN,1967,11537966.0,Asia,34.02,836.197138,Low
4,AFGHANISTAN,1972,13079460.0,Asia,36.088001,739.981106,Low
5,AFGHANISTAN,1977,14880372.0,Asia,38.438,786.11336,Low
6,AFGHANISTAN,1982,12881816.0,Asia,39.854,978.011439,Low
7,AFGHANISTAN,1987,13867957.0,Asia,40.821999,852.395945,Low
8,AFGHANISTAN,1992,16317921.0,Asia,41.674,649.341395,Low
9,AFGHANISTAN,1997,22227415.0,Asia,41.763,635.341351,Low


Łączenie kolumnami z wykorzystaniem pd.concat()

In [69]:
df1 = df.iloc[:10, :3] # pobierz pierwsze 10 wierszy i pierwsze 3 kolumny
df2 = df.iloc[:10, -3:] # pobierz ostatnie 10 wierszy i ostatnie 3 kolumny
pd.concat([df1, df2], axis=1) # "centrowanie" wierszy odbywa się na podstawie wartości indeksu

Unnamed: 0,Country,year,pop,lifeexp,GDP_per_cap,pop_range
0,AFGHANISTAN,1952,8425333.0,28.801001,779.445314,Low
1,AFGHANISTAN,1957,9240934.0,30.332001,820.85303,Low
2,AFGHANISTAN,1962,10267083.0,31.997,853.10071,Low
3,AFGHANISTAN,1967,11537966.0,34.02,836.197138,Low
4,AFGHANISTAN,1972,13079460.0,36.088001,739.981106,Low
5,AFGHANISTAN,1977,14880372.0,38.438,786.11336,Low
6,AFGHANISTAN,1982,12881816.0,39.854,978.011439,Low
7,AFGHANISTAN,1987,13867957.0,40.821999,852.395945,Low
8,AFGHANISTAN,1992,16317921.0,41.674,649.341395,Low
9,AFGHANISTAN,1997,22227415.0,41.763,635.341351,Low


### Merge'owanie obiektów DataFrame

In [90]:
df1 = pd.DataFrame({
    "ID1" : range(1,14,2),
    "value1" : ['A','B','C','D','E','F','G']
})
df1

Unnamed: 0,ID1,value1
0,1,A
1,3,B
2,5,C
3,7,D
4,9,E
5,11,F
6,13,G


In [91]:
df2 = pd.DataFrame({
    "ID2" : range(1,8),
    "value2" : ['A','B','C','D','E','F','G']
})
df2

Unnamed: 0,ID2,value2
0,1,A
1,2,B
2,3,C
3,4,D
4,5,E
5,6,F
6,7,G


Intersekcja zbiorów

In [97]:
pd.merge(df1, df2, left_on='ID1', right_on='ID2', how='inner')

Unnamed: 0,ID1,value1,ID2,value2
0,1,A,1,A
1,3,B,3,C
2,5,C,5,E
3,7,D,7,G


Suma zbiorów

In [98]:
pd.merge(df1, df2, left_on='ID1', right_on='ID2', how='outer')

Unnamed: 0,ID1,value1,ID2,value2
0,1.0,A,1.0,A
1,3.0,B,3.0,C
2,5.0,C,5.0,E
3,7.0,D,7.0,G
4,9.0,E,,
5,11.0,F,,
6,13.0,G,,
7,,,2.0,B
8,,,4.0,D
9,,,6.0,F


Złączenie lewostronne

In [95]:
pd.merge(df1, df2, left_on='ID1', right_on='ID2', how='left', indicator=True) # left join

Unnamed: 0,ID1,value1,ID2,value2,_merge
0,1,A,1.0,A,both
1,3,B,3.0,C,both
2,5,C,5.0,E,both
3,7,D,7.0,G,both
4,9,E,,,left_only
5,11,F,,,left_only
6,13,G,,,left_only


Złączenie prawostronne

In [96]:
pd.merge(df1, df2, left_on='ID1', right_on='ID2', how='right', indicator=True) # right join

Unnamed: 0,ID1,value1,ID2,value2,_merge
0,1.0,A,1,A,both
1,,,2,B,right_only
2,3.0,B,3,C,both
3,,,4,D,right_only
4,5.0,C,5,E,both
5,,,6,F,right_only
6,7.0,D,7,G,both


### Zapisz danych z DataFrame do pliku płaskiego

In [99]:
df.to_csv("filename.csv", index_label="MyIndex", sep=";", decimal=",")