# Pandas
Pandas; veriyi yüklemek, analiz etmek, temizlemek, işlemek ve görselleştirmek için kullanılan bir Python kütüphanesidir.

In [1]:
import pandas as pd
import numpy as np

## Pandas Series
- Pandas Series, her değerin bir indekse sahip olduğu tek boyutlu bir array'dir. 

İlk olarak G7 ülkeri olan Kanada, Fransa, Almanya, İtalya, Japonya, İngiltere ve Amerika'nın popülasyonlarını analiz etmekle başlayalım. Bunun için pandas.Series'leri kullanacağız.

In [2]:
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

- Series'ler bir isme sahip olabilir. 

In [3]:
g7_pop.name = "G7 Population in millions"
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

- Series'lerin veri tipine bakabiliriz.

In [4]:
g7_pop.dtype

dtype('float64')

- Series'lerin değerlerini bir numpy array olarak döndürebiliriz.

In [5]:
g7_pop.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

- Seriesler temel olarak python list'lerine veya numpy array'lerine benzer. Fakat daha çok python dict'e yakındır. Çünkü diğerlerinden farklı olarak otomatik atanan bir index'e sahiptirler

In [6]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [7]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

- bir tabloya benzetmek için bu indexleri değiştirebiliriz.

In [8]:
g7_pop.index=["Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States"]

In [9]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

## Indexing
- Index işlemi list'ler ve dicts'lere benzer olarak çalışır. Aradığımız değerin index'ini kullanırız.

In [10]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [11]:
g7_pop["Canada"]

35.467

In [12]:
g7_pop.Japan

127.061

- Serinin ilk, son veya herhangi bir değerine ulaşmak için .iloc methodu da kullanılabilir.

In [13]:
g7_pop.iloc[0]

35.467

In [14]:
g7_pop.iloc[-1]

318.523

- Tek seferde çoklu elemanlar seçebiliriz.

In [15]:
g7_pop[["Italy", "France"]]

Italy     60.665
France    63.951
Name: G7 Population in millions, dtype: float64

In [16]:
g7_pop.iloc[[0,1]]

Canada    35.467
France    63.951
Name: G7 Population in millions, dtype: float64

- Bunu slicing şeklinde de gerçekleştirebiliriz, fakat farklı olarak pandas'ta verdiğimiz üst limit de dahil edilir

In [17]:
g7_pop["Canada": "Italy"]

Canada     35.467
France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

## Koşullu seçim (boolean arrays)
- Numpy'da kullanılan filtrelemelerin aynısı pandas.Series'te de gerçekleştirebiliriz.

In [18]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

g7_pop[g7_pop > 70]

- bu serinin ortalamasını bulalım

In [19]:
g7_pop.mean()

107.30257142857144

- ortalamadan yüksek ülkeleri bulmak istersek:

In [20]:
g7_pop[g7_pop > g7_pop.mean()]

Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [21]:
g7_pop[ (g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2) ]

France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

## Operations ve Methodlar
- Series'ler ayrıca numpy array'ler gibi operations ve method'ları da destekler

In [22]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [23]:
g7_pop * 1_000_000

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population in millions, dtype: float64

In [24]:
g7_pop.mean()


107.30257142857144

In [25]:
g7_pop[ (g7_pop > 80) | (g7_pop < 200)]

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

## Bir Series'i değiştirmek

In [26]:
g7_pop["Canada"] = 40.5
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [27]:
g7_pop.iloc[-1] = 500
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

In [28]:
g7_pop[g7_pop < 70] = 99.99
g7_pop

Canada             99.990
France             99.990
Germany            80.940
Italy              99.990
Japan             127.061
United Kingdom     99.990
United States     500.000
Name: G7 Population in millions, dtype: float64

## Pandas Dataframes - Veri Tabloları
- Pandas Dataframe, satır ve sütunlardan oluşan 2 boyutlu bir tablodur.

G7 ülkelerinin analiz etmeye devam edelim ve dataframe üzerinde bakalım. DataFrame'ler tablolara çok benzemektedir.

In [29]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


- Dataframe'ler de indexlere sahiptirler.

In [32]:
df.index = ["Canada",
            "France",
            "Germany",
            "Italy",
            "Japan",
            "United Kingdom",
            "United States"]

df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [33]:
df.columns

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')

In [34]:
df.index

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')

- Dataframe hakkında hızlı bir bilgi almak istersek .info() methodunu kullanabiliriz

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to United States
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Continent     7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 280.0+ bytes


In [37]:
df.size

35

In [38]:
df.shape

(7, 5)

- Dataframe hakkında kısa bir istatistik özet için .describe() methodunu kullanırız

In [40]:
df.describe()

Unnamed: 0,Population,GDP,Surface Area,HDI
count,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.900429
std,97.24997,5494020.0,4576187.0,0.016592
min,35.467,1785387.0,242495.0,0.873
25%,62.308,2500716.0,329225.0,0.8895
50%,64.511,2950039.0,377930.0,0.907
75%,104.0005,4238402.0,5082873.0,0.914
max,318.523,17348080.0,9984670.0,0.916


In [41]:
df.dtypes

Population      float64
GDP               int64
Surface Area      int64
HDI             float64
Continent        object
dtype: object

In [42]:
df.dtypes.value_counts()

float64    2
int64      2
object     1
dtype: int64

### Indexleme, Secme ve Bölme 

- DataFrame'deki bireysel sütunlar düzenli indexleme ile seçilebilir. Her sütun bir Series'i temsil eder.

In [43]:
df["Population"]

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

In [44]:
df.Population

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

- .loc ve iloc yatay olarak veri aktarır. Farkı ise .loc'ta indexin ismini, .iloc'ta ise indexin pozisyonunun sayısal değerini kullanırız

In [48]:
df.loc["Canada"]

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object

In [49]:
df.iloc[-1]

Population       318.523
GDP             17348075
Surface Area     9525067
HDI                0.915
Continent        America
Name: United States, dtype: object

In [50]:
type(df.iloc[-1])

pandas.core.series.Series

- to_frame() methodu ile bir Series'i tekrar bir dataframe'e dönüştürebiliriz.

df["Population"].to_frame()

- Çoklu bir şekilde de sütun seçebiliriz.

In [54]:
df[[ "Population", "GDP" ]]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744
Japan,127.061,4602367
United Kingdom,64.511,2950039
United States,318.523,17348075


In [55]:
df[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe


- Üstteki gibi belirli satırlara ulaşamak istediğimizde, .loc() ve .iloc() methodlarını kullanmak daha efektif olacaktır.

In [58]:
df.loc["France": "Italy"]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe


- Sadece belirli bir sütuna ulaşmak için bir sütun ismi de ekleyebiliriz

In [59]:
df.loc["France" : "Italy", "Population"]

France     63.951
Germany    80.940
Italy      60.665
Name: Population, dtype: float64

- Veya belirli sütunlara ulaşmak istersek:

In [63]:
df.loc["France" : "Italy", ["Population" , "GDP"] ]

Unnamed: 0,Population,GDP
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744


- .iloc()'da aynı sekilde calisabilir.

In [64]:
df.iloc[ [0, 1, -1] ]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
United States,318.523,17348075,9525067,0.915,America


In [65]:
df.iloc[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe


In [66]:
df.iloc[1:3, 3]

France     0.888
Germany    0.916
Name: HDI, dtype: float64

In [67]:
df.iloc[1:3, [0,3]]

Unnamed: 0,Population,HDI
France,63.951,0.888
Germany,80.94,0.916


### Koşullu Seçme (Boolean Arrays)

- pandas.Series'te gördüğümüz tüm koşullu seçmeler, dataframe'ler için de geçerlidir.

df

In [70]:
df.Population > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: Population, dtype: bool

In [71]:
df[df.Population > 70]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United States,318.523,17348075,9525067,0.915,America


In [75]:
df.loc[df.Population > 70]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United States,318.523,17348075,9525067,0.915,America


- Benzer olarak sadece bir sütuna bakmak istersek:

In [74]:
df.loc[df.Population > 70, "Population"]

Germany           80.940
Japan            127.061
United States    318.523
Name: Population, dtype: float64

In [77]:
df.loc[df.Population > 70, ["Population", "GDP"] ]

Unnamed: 0,Population,GDP
Germany,80.94,3874437
Japan,127.061,4602367
United States,318.523,17348075


### Veri Çıkarma - Silme
- Bazı durumlarda dataframe'den veri silmemiz de gerekebilir.

In [79]:
df.drop("Canada")

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [80]:
df.drop(["Canada", "Japan"])

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [81]:
df.drop(columns=["Population", "GDP"])

Unnamed: 0,Surface Area,HDI,Continent
Canada,9984670,0.913,America
France,640679,0.888,Europe
Germany,357114,0.916,Europe
Italy,301336,0.873,Europe
Japan,377930,0.891,Asia
United Kingdom,242495,0.907,Europe
United States,9525067,0.915,America
