# Pandas - Python Data Analysis Library

Znamo da Python ima nekoliko tipova podataka koji se odnose na kolekcije. To su liste, rječnici, n-terci. Rad s ovim tipovima podataka nije tako brz kao u jezicima baziranim na C-u ili u Fortranu. Zbog toga su nastale biblioteke koje imaju učinkovitost kao da su pisane u C-u ili Fortranu, a istovremeno imaju jednostavnost pisanja kôda kao u Pythonu.

Pandas je jedna od njih. To je Python biblioteka za analizu i manipulaciju podataka, strukturiranih u obliku  tablice. Pandas koristi NumPy biblioteku kao podlogu. Njegova osnovna karakteristika je također rad s tabličnim tipom podataka koji može pohraniti samo podatke koji se zbrajaju i množe (najčešće brojeve). Pandas, s druge strane, također radi s tabličnim tipom podataka, ali za razliku od NumPy biblioteke, nema uvjet koji se tip podatka može pohraniti u tablici.

Pandas je jako korisna biblioteka i ovo su samo neke od situacija u kojima se koristi:<br>
- upravljanje podacima koji nedostaju u tablicama s brojevima (na njihova mjesta dodaje *NaN - Not a Number*)
- mogućnost promjene veličine objekta tako što se dodaju ili oduzimaju kolone iz *DataFrame* objekta
- fleksibilno, a opet učinkovito grupiranje podataka koje omogućava modifikacije kao što su podjela, dodavanje, kombiniranje...
- jednostavna konverzija Python i NumPy struktura podataka, bez obzira na način indeksacije izvornih objekata u *DateFrame* objekt
- mogućnost izdvajanja dijelova podataka u zasebne cjeline na osnovi naziva kolona
- promjena oblika i rotacija (eng. *pivoting*) tablica
- uvoz i izvoz podataka iz datoteka CSV, Excel, baze podataka.

... i još puno drugih.

In [4]:
# pip install pandas

import pandas as pd

# Pandas struktura podataka

Pandas se koristi za rad s podacima u obliku tablice. Nudi mogućnost da se učita cijela tablica i od nje napravi Python objekt s kojim možemo raditi svakojake manipulacije na podacima unutar tog objekta.

## Dataframe

Tablični oblik pohrane podataka zastupljen je u bazama podataka, tabličnim kalkulatorima (Excel, CSV). Ovakav, tablični tip podataka u Pandasu naziva se ***DataFrame*** i on predstavlja tablicu koja ima više redaka i stupaca. *DataFrame* ima retke i stupce pa kažemo da ima dvije dimenzije (os 0 - stupci i os 1 - redovi).

<img src="slike/01_table_dataframe.svg" alt="pandas dataframe"/>

In [2]:
data = {
    "Stupac 1": ["S1_Podatak 1", "S1_Podatak 2", "S1_Podatak 3", "S1_Podatak 4", "S1_Podatak 5"],
    "Stupac 2": ["S2_Podatak 1", "S2_Podatak 2", "S2_Podatak 3", "S2_Podatak 4", "S2_Podatak 5"],
    "Stupac 3": ["S3_Podatak 1", "S3_Podatak 2", "S3_Podatak 3", "S3_Podatak 4", "S3_Podatak 5"],
    "Stupac 4": ["S4_Podatak 1", "S4_Podatak 2", "S4_Podatak 3", "S4_Podatak 4", "S4_Podatak 5"],
}

In [3]:
df = pd.DataFrame(data)

df

Unnamed: 0,Stupac 1,Stupac 2,Stupac 3,Stupac 4
0,S1_Podatak 1,S2_Podatak 1,S3_Podatak 1,S4_Podatak 1
1,S1_Podatak 2,S2_Podatak 2,S3_Podatak 2,S4_Podatak 2
2,S1_Podatak 3,S2_Podatak 3,S3_Podatak 3,S4_Podatak 3
3,S1_Podatak 4,S2_Podatak 4,S3_Podatak 4,S4_Podatak 4
4,S1_Podatak 5,S2_Podatak 5,S3_Podatak 5,S4_Podatak 5


## Pandas `series` ili serija

Drugi oblik podataka, za čiju obradu se koristi Pandas je *Series* ili serija kako ćemo je mi zvati. Serija predstavlja stupac u *DataFrame* tablici i zbog toga ima samo jednu dimenziju (os 0).

<img src="slike/01_table_series.svg" alt="pandas dataframe"/>

In [4]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


### Pristup podacima iz `dataframe-a`

In [5]:
df

Unnamed: 0,Stupac 1,Stupac 2,Stupac 3,Stupac 4
0,S1_Podatak 1,S2_Podatak 1,S3_Podatak 1,S4_Podatak 1
1,S1_Podatak 2,S2_Podatak 2,S3_Podatak 2,S4_Podatak 2
2,S1_Podatak 3,S2_Podatak 3,S3_Podatak 3,S4_Podatak 3
3,S1_Podatak 4,S2_Podatak 4,S3_Podatak 4,S4_Podatak 4
4,S1_Podatak 5,S2_Podatak 5,S3_Podatak 5,S4_Podatak 5


In [10]:
stupac = df["Stupac 1"]
stupac
# print(type(df["Stupac 1"]))

0    S1_Podatak 1
1    S1_Podatak 2
2    S1_Podatak 3
3    S1_Podatak 4
4    S1_Podatak 5
Name: Stupac 1, dtype: object

In [11]:
stupac.name

'Stupac 1'

In [16]:
serija = pd.Series([5, 8, 2, 6, 5, 4, 8], name="Testna serija")
serija

0    5
1    8
2    2
3    6
4    5
5    4
6    8
Name: Testna serija, dtype: int64

In [20]:
serija_indeksi = pd.Series([5, 8, 2, 6], name="Indeks serija", index=["prvi", "drugi", "treci", "cetvrti"])
serija_indeksi

prvi       5
drugi      8
treci      2
cetvrti    6
Name: Indeks serija, dtype: int64

In [21]:
serija_indeksi["prvi"]

5

In [23]:
serija_indeksi[1]

8

In [30]:
serija_indeksi[::2]

prvi     5
treci    2
Name: Indeks serija, dtype: int64

In [31]:
serija_indeksi

prvi       5
drugi      8
treci      2
cetvrti    6
Name: Indeks serija, dtype: int64

In [34]:
serija_umnozak = serija_indeksi * 2

In [35]:
serija_umnozak

prvi       10
drugi      16
treci       4
cetvrti    12
Name: Indeks serija, dtype: int64

In [36]:
serija_indeksi

prvi       5
drugi      8
treci      2
cetvrti    6
Name: Indeks serija, dtype: int64

In [37]:
serija_indeksi["prvi"] = 7

In [38]:
serija_indeksi

prvi       7
drugi      8
treci      2
cetvrti    6
Name: Indeks serija, dtype: int64

In [39]:
serija_veci_od_6 = serija_indeksi[serija_indeksi > 6]

In [41]:
serija_indeksi > 6

prvi        True
drugi       True
treci      False
cetvrti    False
Name: Indeks serija, dtype: bool

In [40]:
serija_veci_od_6

prvi     7
drugi    8
Name: Indeks serija, dtype: int64

In [43]:
serija_duza = pd.Series([5, 6, 1, 7, 8, 2, 3, 1, 6, 7, 1, 5, 1, 6], name="Duza serija")
serija_duza

0     5
1     6
2     1
3     7
4     8
5     2
6     3
7     1
8     6
9     7
10    1
11    5
12    1
13    6
Name: Duza serija, dtype: int64

In [46]:
serija_duza.unique()  # jedinstveni elementi (bez ponavljanja)

array([5, 6, 1, 7, 8, 2, 3], dtype=int64)

In [47]:
print(type(serija_duza.unique()))

<class 'numpy.ndarray'>


In [50]:
# prebrojiti vrijednosti u seriji
serija_duza.value_counts().sort_values(ascending=False)

1    4
6    3
5    2
7    2
8    1
2    1
3    1
Name: Duza serija, dtype: int64

In [51]:
serija_duza.isin([1, 5, 7])

0      True
1     False
2      True
3      True
4     False
5     False
6     False
7      True
8     False
9      True
10     True
11     True
12     True
13    False
Name: Duza serija, dtype: bool

In [52]:
serija_duza[serija_duza.isin([1, 5, 7])]

0     5
2     1
3     7
7     1
9     7
10    1
11    5
12    1
Name: Duza serija, dtype: int64

## Zadatak

Generirajte nasumičnih 100 brojeva te ih pohranite u Python listu. Od te liste kreirajte Pandas seriju te joj dajte nekakav naziv.

- Provjerite, koliko ima brojeva koji se ponavljaju te koliko se puta svaki od njih ponavlja?
- Koji je broj najveći, a koji najmanji?
- Modificirajte podatke u seriji tako da se svedu na postotke (neka budu u rasponu od 0 do 1).
- Promijenite vrijednosti najvećeg broja tako da je identična vrijednosti najmanjeg broja. Provjerite koji je sada najveći broj.

In [59]:
import random as rd
lista = []

for i in range(100):
    lista.append(rd.randint(1, 101))

random_serija = pd.Series(lista, name="Random serija")

In [57]:
random_serija

0     69
1     45
2     47
3     55
4     81
      ..
95    18
96    44
97    67
98    58
99    90
Name: Random serija, Length: 100, dtype: int64

In [60]:
random_serija.value_counts().sort_values(ascending=False)

57    5
96    4
40    3
36    3
3     3
     ..
46    1
21    1
42    1
33    1
53    1
Name: Random serija, Length: 63, dtype: int64

In [63]:
random_serija.min()  # najmanja vrijednost u seriji

1

In [64]:
random_serija.max()  # najveća vrijednost u seriji

99

In [65]:
random_serija = random_serija / 100

In [66]:
random_serija

0     0.67
1     0.40
2     0.52
3     0.74
4     0.70
      ... 
95    0.16
96    0.57
97    0.85
98    0.47
99    0.53
Name: Random serija, Length: 100, dtype: float64

### Zadatak
**Tekst za obradu:**

*Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.*

Obradite gore navedeni tekst tako da od njega kreirate listu riječi koje imaju sva mala slova i nemaju znakove interpunkcije. Od te liste kreirajte Pandas seriju te joj dajte nekakav naziv.

Provjerite, koliko ima riječi koje se ponavljaju te koliko puta se svaka od njih ponavlja?
Provjerite koja je najduža, a koja najkraća riječ.

In [2]:
paragraf = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

paragraf = paragraf.lower().replace(".", "").replace(",", "")
print(paragraf)

lorem ipsum dolor sit amet consectetur adipiscing elit sed do eiusmod tempor incididunt ut labore et dolore magna aliqua ut enim ad minim veniam quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur excepteur sint occaecat cupidatat non proident sunt in culpa qui officia deserunt mollit anim id est laborum


In [3]:
lista_rijeci = paragraf.split()  # .split(" ")
print(lista_rijeci)

['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'consectetur', 'adipiscing', 'elit', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', 'ut', 'enim', 'ad', 'minim', 'veniam', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut', 'aliquip', 'ex', 'ea', 'commodo', 'consequat', 'duis', 'aute', 'irure', 'dolor', 'in', 'reprehenderit', 'in', 'voluptate', 'velit', 'esse', 'cillum', 'dolore', 'eu', 'fugiat', 'nulla', 'pariatur', 'excepteur', 'sint', 'occaecat', 'cupidatat', 'non', 'proident', 'sunt', 'in', 'culpa', 'qui', 'officia', 'deserunt', 'mollit', 'anim', 'id', 'est', 'laborum']


In [5]:
pd_rijeci = pd.Series(lista_rijeci, name="Lorem ipsum")

In [6]:
pd_rijeci

0       lorem
1       ipsum
2       dolor
3         sit
4        amet
       ...   
64     mollit
65       anim
66         id
67        est
68    laborum
Name: Lorem ipsum, Length: 69, dtype: object

In [7]:
pd_rijeci.value_counts().sort_values(ascending=False)

in          3
ut          3
dolore      2
dolor       2
elit        1
           ..
officia     1
deserunt    1
mollit      1
anim        1
laborum     1
Name: Lorem ipsum, Length: 63, dtype: int64

In [10]:
broj_ponavljanja = pd_rijeci.value_counts()

In [15]:
broj_ponavljanja[broj_ponavljanja > 1]

in        3
ut        3
dolore    2
dolor     2
Name: Lorem ipsum, dtype: int64

In [23]:
pd_rijeci

0       lorem
1       ipsum
2       dolor
3         sit
4        amet
       ...   
64     mollit
65       anim
66         id
67        est
68    laborum
Name: Lorem ipsum, Length: 69, dtype: object

In [30]:
pd_rijeci.min()  # najkraća riječ

'ad'

pd_rijeci.max()  # najduža riječ

## Dataframe

In [33]:
proizvodi_baza = {
    "Naziv": [
        "Trek", "Kona", "Giant",
        "Bianchi", "Cosmo Ride", "Neon",
        "Zeeclo", "Atomic", "Head",
        "Elan", "Salomon", "Rossignol"
    ],
    "Cijena": [
        7699.00, 4699.00, 5999.00,
        22499.00, 2599.00, 1999.00,
        2799.00, 1499.00, 1359.00,
        1499.00, 1699.00, 999.00
    ],
    "Kategorija": [
        "Bicikl", "Bicikl", "Bicikl",
        "Bicikl", "E-Romobil", "E-Romobil",
        "E-Romobil", "Skije", "Skije",
        "Skije", "Skije", "Skije"
    ],
    "Ocjena": [
        6.7, 6.5, 6.65,
        7.2, 8.64, 8.61,
        8.59, 7.99, 8.15,
        8.05, 7.91, 6.10
    ]
}

In [34]:
proizvodi = pd.DataFrame(proizvodi_baza)
proizvodi

Unnamed: 0,Naziv,Cijena,Kategorija,Ocjena
0,Trek,7699.0,Bicikl,6.7
1,Kona,4699.0,Bicikl,6.5
2,Giant,5999.0,Bicikl,6.65
3,Bianchi,22499.0,Bicikl,7.2
4,Cosmo Ride,2599.0,E-Romobil,8.64
5,Neon,1999.0,E-Romobil,8.61
6,Zeeclo,2799.0,E-Romobil,8.59
7,Atomic,1499.0,Skije,7.99
8,Head,1359.0,Skije,8.15
9,Elan,1499.0,Skije,8.05


In [35]:
proizvodi_dio = pd.DataFrame(proizvodi_baza, columns=["Naziv", "Cijena"])
proizvodi_dio

Unnamed: 0,Naziv,Cijena
0,Trek,7699.0
1,Kona,4699.0
2,Giant,5999.0
3,Bianchi,22499.0
4,Cosmo Ride,2599.0
5,Neon,1999.0
6,Zeeclo,2799.0
7,Atomic,1499.0
8,Head,1359.0
9,Elan,1499.0


Dataframe najčešće nećemo kreirati na ovaj način, već ćemo ga dobiti učitavanjem podataka iz nekog izvora, primjerice `.csv` datoteke. U tom slučaju, jedan od prvih koraka prije bilo kakve analize tih podataka je vidjeti s kojim podacima raspolažemo:

In [41]:
proizvodi.columns

Index(['Naziv', 'Cijena', 'Kategorija', 'Ocjena'], dtype='object')

In [39]:
proizvodi.values

array([['Trek', 7699.0, 'Bicikl', 6.7],
       ['Kona', 4699.0, 'Bicikl', 6.5],
       ['Giant', 5999.0, 'Bicikl', 6.65],
       ['Bianchi', 22499.0, 'Bicikl', 7.2],
       ['Cosmo Ride', 2599.0, 'E-Romobil', 8.64],
       ['Neon', 1999.0, 'E-Romobil', 8.61],
       ['Zeeclo', 2799.0, 'E-Romobil', 8.59],
       ['Atomic', 1499.0, 'Skije', 7.99],
       ['Head', 1359.0, 'Skije', 8.15],
       ['Elan', 1499.0, 'Skije', 8.05],
       ['Salomon', 1699.0, 'Skije', 7.91],
       ['Rossignol', 999.0, 'Skije', 6.1]], dtype=object)

In [42]:
proizvodi.index

RangeIndex(start=0, stop=12, step=1)

In [43]:
proizvodi.index.name = "ID"

In [44]:
proizvodi

Unnamed: 0_level_0,Naziv,Cijena,Kategorija,Ocjena
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Trek,7699.0,Bicikl,6.7
1,Kona,4699.0,Bicikl,6.5
2,Giant,5999.0,Bicikl,6.65
3,Bianchi,22499.0,Bicikl,7.2
4,Cosmo Ride,2599.0,E-Romobil,8.64
5,Neon,1999.0,E-Romobil,8.61
6,Zeeclo,2799.0,E-Romobil,8.59
7,Atomic,1499.0,Skije,7.99
8,Head,1359.0,Skije,8.15
9,Elan,1499.0,Skije,8.05


### Pristup elementima u `dataframe-u`

In [47]:
# pristup stupcu

proizvodi["Naziv"]

ID
0           Trek
1           Kona
2          Giant
3        Bianchi
4     Cosmo Ride
5           Neon
6         Zeeclo
7         Atomic
8           Head
9           Elan
10       Salomon
11     Rossignol
Name: Naziv, dtype: object

In [48]:
# pristup retku
proizvodi[0]

KeyError: 0

Očito retku ne pristupamo na ovaj način.

In [54]:
proizvodi["Naziv"][11]

'Rossignol'

In [56]:
proizvodi["Naziv"][:5:-1]

ID
11    Rossignol
10      Salomon
9          Elan
8          Head
7        Atomic
6        Zeeclo
Name: Naziv, dtype: object

In [62]:
proizvodi[:1]

Unnamed: 0_level_0,Naziv,Cijena,Kategorija,Ocjena
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Trek,7699.0,Bicikl,6.7


In [74]:
proizvodi.loc[0]  # dohvat prvog retka u dataframe-u, rezultat je serija

Naziv           Trek
Cijena        7699.0
Kategorija    Bicikl
Ocjena           6.7
Name: 0, dtype: object

In [75]:
proizvodi.loc[[0, 2, 7]]  # dohvat više proizvoljnih redaka iz dataframe-e, rezultat je drugi dataframe

Unnamed: 0_level_0,Naziv,Cijena,Kategorija,Ocjena
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Trek,7699.0,Bicikl,6.7
2,Giant,5999.0,Bicikl,6.65
7,Atomic,1499.0,Skije,7.99


In [76]:
# dodavanje novog stupca u dataframeu

In [78]:
proizvodi["Količina na skladištu"] = 15
proizvodi

Unnamed: 0_level_0,Naziv,Cijena,Kategorija,Ocjena,Količina na skladištu
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Trek,7699.0,Bicikl,6.7,15
1,Kona,4699.0,Bicikl,6.5,15
2,Giant,5999.0,Bicikl,6.65,15
3,Bianchi,22499.0,Bicikl,7.2,15
4,Cosmo Ride,2599.0,E-Romobil,8.64,15
5,Neon,1999.0,E-Romobil,8.61,15
6,Zeeclo,2799.0,E-Romobil,8.59,15
7,Atomic,1499.0,Skije,7.99,15
8,Head,1359.0,Skije,8.15,15
9,Elan,1499.0,Skije,8.05,15


In [84]:
import numpy as np

proizvodi["Količina na skladištu"] = np.random.randint(100, size=12)
proizvodi

Unnamed: 0_level_0,Naziv,Cijena,Kategorija,Ocjena,Količina na skladištu
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Trek,7699.0,Bicikl,6.7,28
1,Kona,4699.0,Bicikl,6.5,53
2,Giant,5999.0,Bicikl,6.65,21
3,Bianchi,22499.0,Bicikl,7.2,47
4,Cosmo Ride,2599.0,E-Romobil,8.64,45
5,Neon,1999.0,E-Romobil,8.61,99
6,Zeeclo,2799.0,E-Romobil,8.59,13
7,Atomic,1499.0,Skije,7.99,17
8,Head,1359.0,Skije,8.15,2
9,Elan,1499.0,Skije,8.05,37


In [85]:
# brisanje stupca iz dataframe-a
del proizvodi["Količina na skladištu"]
proizvodi

Unnamed: 0_level_0,Naziv,Cijena,Kategorija,Ocjena
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Trek,7699.0,Bicikl,6.7
1,Kona,4699.0,Bicikl,6.5
2,Giant,5999.0,Bicikl,6.65
3,Bianchi,22499.0,Bicikl,7.2
4,Cosmo Ride,2599.0,E-Romobil,8.64
5,Neon,1999.0,E-Romobil,8.61
6,Zeeclo,2799.0,E-Romobil,8.59
7,Atomic,1499.0,Skije,7.99
8,Head,1359.0,Skije,8.15
9,Elan,1499.0,Skije,8.05


# Čitanje i pisanje podataka

In [89]:
cars = pd.read_csv("podaci/cars.csv")
cars

Unnamed: 0,Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
0,STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;...
1,Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3...
2,Buick Skylark 320;15.0;8;350.0;165.0;3693.;11....
3,Plymouth Satellite;18.0;8;318.0;150.0;3436.;11...
4,AMC Rebel SST;16.0;8;304.0;150.0;3433.;12.0;70;US
...,...
402,Ford Mustang GL;27.0;4;140.0;86.00;2790.;15.6;...
403,Volkswagen Pickup;44.0;4;97.00;52.00;2130.;24....
404,Dodge Rampage;32.0;4;135.0;84.00;2295.;11.6;82;US
405,Ford Ranger;28.0;4;120.0;79.00;2625.;18.6;82;US


Ovo baš nije najsretniji format iz kojeg možemo nešto zaključiti jer nam se sve nalazi u jednom stupcu. Zašto? Zato što su podaci u našem `.csv-u` odvojeni sa `;`, dok `pandas` kao defaultni separator očekuje `,`.

To je problem koji možemo vrlo jednostavno riješiti.

In [90]:
cars = pd.read_csv("podaci/cars.csv", sep=";")
cars

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
0,STRING,DOUBLE,INT,DOUBLE,DOUBLE,DOUBLE,DOUBLE,INT,CAT
1,Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504.,12.0,70,US
2,Buick Skylark 320,15.0,8,350.0,165.0,3693.,11.5,70,US
3,Plymouth Satellite,18.0,8,318.0,150.0,3436.,11.0,70,US
4,AMC Rebel SST,16.0,8,304.0,150.0,3433.,12.0,70,US
...,...,...,...,...,...,...,...,...,...
402,Ford Mustang GL,27.0,4,140.0,86.00,2790.,15.6,82,US
403,Volkswagen Pickup,44.0,4,97.00,52.00,2130.,24.6,82,Europe
404,Dodge Rampage,32.0,4,135.0,84.00,2295.,11.6,82,US
405,Ford Ranger,28.0,4,120.0,79.00,2625.,18.6,82,US


In [91]:
cars.columns

Index(['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
       'Acceleration', 'Model', 'Origin'],
      dtype='object')

In [93]:
cars.values

array([['STRING', 'DOUBLE', 'INT', ..., 'DOUBLE', 'INT', 'CAT'],
       ['Chevrolet Chevelle Malibu', '18.0', '8', ..., '12.0', '70',
        'US'],
       ['Buick Skylark 320', '15.0', '8', ..., '11.5', '70', 'US'],
       ...,
       ['Dodge Rampage', '32.0', '4', ..., '11.6', '82', 'US'],
       ['Ford Ranger', '28.0', '4', ..., '18.6', '82', 'US'],
       ['Chevy S-10', '31.0', '4', ..., '19.4', '82', 'US']], dtype=object)

In [94]:
cars.dtypes

Car             object
MPG             object
Cylinders       object
Displacement    object
Horsepower      object
Weight          object
Acceleration    object
Model           object
Origin          object
dtype: object

In [98]:
cars = cars.drop(0)  
cars["Horsepower"] = cars["Horsepower"].astype("float")


Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
1,Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504.,12.0,70,US
2,Buick Skylark 320,15.0,8,350.0,165.0,3693.,11.5,70,US
3,Plymouth Satellite,18.0,8,318.0,150.0,3436.,11.0,70,US
4,AMC Rebel SST,16.0,8,304.0,150.0,3433.,12.0,70,US
5,Ford Torino,17.0,8,302.0,140.0,3449.,10.5,70,US
...,...,...,...,...,...,...,...,...,...
402,Ford Mustang GL,27.0,4,140.0,86.0,2790.,15.6,82,US
403,Volkswagen Pickup,44.0,4,97.00,52.0,2130.,24.6,82,Europe
404,Dodge Rampage,32.0,4,135.0,84.0,2295.,11.6,82,US
405,Ford Ranger,28.0,4,120.0,79.0,2625.,18.6,82,US


In [99]:
cars.dtypes

Car              object
MPG              object
Cylinders        object
Displacement     object
Horsepower      float64
Weight           object
Acceleration     object
Model            object
Origin           object
dtype: object

In [100]:
cars["MPG"] = cars["MPG"].astype("float")
cars["Cylinders"] = cars["Cylinders"].astype("int")
cars["Displacement"] = cars["Displacement"].astype("float")
cars["Weight"] = cars["Weight"].astype(float)
cars["Acceleration"] = cars["Acceleration"].astype("float")
cars["Model"] = cars["Model"].astype("int")

cars.dtypes

Car              object
MPG             float64
Cylinders         int32
Displacement    float64
Horsepower      float64
Weight          float64
Acceleration    float64
Model             int32
Origin           object
dtype: object

In [105]:
cars["Car"] = cars["Car"].astype("string")
cars.dtypes

Car              string
MPG             float64
Cylinders         int32
Displacement    float64
Horsepower      float64
Weight          float64
Acceleration    float64
Model             int32
Origin           object
dtype: object

In [111]:
cars = pd.read_csv("podaci/cars.csv", sep=";", skiprows=[1])
cars

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504.0,12.0,70,US
1,Buick Skylark 320,15.0,8,350.0,165.0,3693.0,11.5,70,US
2,Plymouth Satellite,18.0,8,318.0,150.0,3436.0,11.0,70,US
3,AMC Rebel SST,16.0,8,304.0,150.0,3433.0,12.0,70,US
4,Ford Torino,17.0,8,302.0,140.0,3449.0,10.5,70,US
...,...,...,...,...,...,...,...,...,...
401,Ford Mustang GL,27.0,4,140.0,86.0,2790.0,15.6,82,US
402,Volkswagen Pickup,44.0,4,97.0,52.0,2130.0,24.6,82,Europe
403,Dodge Rampage,32.0,4,135.0,84.0,2295.0,11.6,82,US
404,Ford Ranger,28.0,4,120.0,79.0,2625.0,18.6,82,US


In [107]:
cars.dtypes

Car              object
MPG             float64
Cylinders         int64
Displacement    float64
Horsepower      float64
Weight          float64
Acceleration    float64
Model             int64
Origin           object
dtype: object

In [114]:
cars.head(20)  # prvih 20 redaka u dataframe-u

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504.0,12.0,70,US
1,Buick Skylark 320,15.0,8,350.0,165.0,3693.0,11.5,70,US
2,Plymouth Satellite,18.0,8,318.0,150.0,3436.0,11.0,70,US
3,AMC Rebel SST,16.0,8,304.0,150.0,3433.0,12.0,70,US
4,Ford Torino,17.0,8,302.0,140.0,3449.0,10.5,70,US
5,Ford Galaxie 500,15.0,8,429.0,198.0,4341.0,10.0,70,US
6,Chevrolet Impala,14.0,8,454.0,220.0,4354.0,9.0,70,US
7,Plymouth Fury iii,14.0,8,440.0,215.0,4312.0,8.5,70,US
8,Pontiac Catalina,14.0,8,455.0,225.0,4425.0,10.0,70,US
9,AMC Ambassador DPL,15.0,8,390.0,190.0,3850.0,8.5,70,US


In [115]:
cars.tail(20)   # zadnjih 20 redaka u dataframeu

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin
386,Plymouth Horizon Miser,38.0,4,105.0,63.0,2125.0,14.7,82,US
387,Mercury Lynx l,36.0,4,98.0,70.0,2125.0,17.3,82,US
388,Nissan Stanza XE,36.0,4,120.0,88.0,2160.0,14.5,82,Japan
389,Honda Accord,36.0,4,107.0,75.0,2205.0,14.5,82,Japan
390,Toyota Corolla,34.0,4,108.0,70.0,2245.0,16.9,82,Japan
391,Honda Civic,38.0,4,91.0,67.0,1965.0,15.0,82,Japan
392,Honda Civic (auto),32.0,4,91.0,67.0,1965.0,15.7,82,Japan
393,Datsun 310 GX,38.0,4,91.0,67.0,1995.0,16.2,82,Japan
394,Buick Century Limited,25.0,6,181.0,110.0,2945.0,16.4,82,US
395,Oldsmobile Cutlass Ciera (diesel),38.0,6,262.0,85.0,3015.0,17.0,82,US


In [117]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Car           406 non-null    object 
 1   MPG           406 non-null    float64
 2   Cylinders     406 non-null    int64  
 3   Displacement  406 non-null    float64
 4   Horsepower    406 non-null    float64
 5   Weight        406 non-null    float64
 6   Acceleration  406 non-null    float64
 7   Model         406 non-null    int64  
 8   Origin        406 non-null    object 
dtypes: float64(5), int64(2), object(2)
memory usage: 28.7+ KB


In [119]:
cars.empty   # provjera imamo li praznu "ćeliju" u našem dataframe-u

False

## Zadatak

Učitati podatke iz datoteke `e-cars.csv` u `dataframe` i odraditi njihovu normalizaciju, tj čišćenje.

In [127]:
e_cars = pd.read_csv("podaci/e-cars.csv", sep=";")

e_cars

Unnamed: 0,YEAR,Make,Model,Size,(kW),Unnamed: 5,TYPE,CITY (kWh/100 km),HWY (kWh/100 km),COMB (kWh/100 km),CITY (Le/100 km),HWY (Le/100 km),COMB (Le/100 km),(g/km),RATING,(km),TIME (h)
0,2012,MITSUBISHI,i-MiEV,SUBCOMPACT,49,A1,B,16.9,21.4,18.7,1.9,2.4,2.1,0,,100,7
1,2012,NISSAN,LEAF,MID-SIZE,80,A1,B,19.3,23.0,21.1,2.2,2.6,2.4,0,,117,7
2,2013,FORD,FOCUS ELECTRIC,COMPACT,107,A1,B,19.0,21.1,20.0,2.1,2.4,2.2,0,,122,4
3,2013,MITSUBISHI,i-MiEV,SUBCOMPACT,49,A1,B,16.9,21.4,18.7,1.9,2.4,2.1,0,,100,7
4,2013,NISSAN,LEAF,MID-SIZE,80,A1,B,19.3,23.0,21.1,2.2,2.6,2.4,0,,117,7
5,2013,SMART,FORTWO ELECTRIC DRIVE CABRIOLET,TWO-SEATER,35,A1,B,17.2,22.5,19.6,1.9,2.5,2.2,0,,109,8
6,2013,SMART,FORTWO ELECTRIC DRIVE COUPE,TWO-SEATER,35,A1,B,17.2,22.5,19.6,1.9,2.5,2.2,0,,109,8
7,2013,TESLA,MODEL S (40 kWh battery),FULL-SIZE,270,A1,B,22.4,21.9,22.2,2.5,2.5,2.5,0,,224,6
8,2013,TESLA,MODEL S (60 kWh battery),FULL-SIZE,270,A1,B,22.2,21.7,21.9,2.5,2.4,2.5,0,,335,10
9,2013,TESLA,MODEL S (85 kWh battery),FULL-SIZE,270,A1,B,23.8,23.2,23.6,2.7,2.6,2.6,0,,426,12


In [122]:
e_cars.columns

Index(['YEAR', 'Make', 'Model', 'Size', '(kW)', 'Unnamed: 5', 'TYPE',
       'CITY (kWh/100 km)', 'HWY (kWh/100 km)', 'COMB (kWh/100 km)',
       'CITY (Le/100 km)', 'HWY (Le/100 km)', 'COMB (Le/100 km)', '(g/km)',
       'RATING', '(km)', 'TIME (h)'],
      dtype='object')

In [128]:
e_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   YEAR               53 non-null     int64  
 1   Make               53 non-null     object 
 2   Model              53 non-null     object 
 3   Size               53 non-null     object 
 4   (kW)               53 non-null     int64  
 5   Unnamed: 5         53 non-null     object 
 6   TYPE               53 non-null     object 
 7   CITY (kWh/100 km)  53 non-null     float64
 8   HWY (kWh/100 km)   53 non-null     float64
 9   COMB (kWh/100 km)  53 non-null     float64
 10  CITY (Le/100 km)   53 non-null     float64
 11  HWY (Le/100 km)    53 non-null     float64
 12  COMB (Le/100 km)   53 non-null     float64
 13  (g/km)             53 non-null     int64  
 14  RATING             19 non-null     float64
 15  (km)               53 non-null     int64  
 16  TIME (h)           53 non-nu

In [129]:
e_cars["Unnamed: 5"].unique()

array(['A1'], dtype=object)

In [132]:
e_cars["Unnamed: 5"].value_counts()

A1    53
Name: Unnamed: 5, dtype: int64

In [130]:
e_cars["Make"].unique()

array(['MITSUBISHI', 'NISSAN', 'FORD', 'SMART', 'TESLA', 'CHEVROLET',
       'BMW', 'KIA'], dtype=object)

In [133]:
e_cars["TYPE"].unique()

array(['B'], dtype=object)

In [134]:
del e_cars["Unnamed: 5"]

In [135]:
del e_cars["TYPE"]

In [136]:
e_cars

Unnamed: 0,YEAR,Make,Model,Size,(kW),CITY (kWh/100 km),HWY (kWh/100 km),COMB (kWh/100 km),CITY (Le/100 km),HWY (Le/100 km),COMB (Le/100 km),(g/km),RATING,(km),TIME (h)
0,2012,MITSUBISHI,i-MiEV,SUBCOMPACT,49,16.9,21.4,18.7,1.9,2.4,2.1,0,,100,7
1,2012,NISSAN,LEAF,MID-SIZE,80,19.3,23.0,21.1,2.2,2.6,2.4,0,,117,7
2,2013,FORD,FOCUS ELECTRIC,COMPACT,107,19.0,21.1,20.0,2.1,2.4,2.2,0,,122,4
3,2013,MITSUBISHI,i-MiEV,SUBCOMPACT,49,16.9,21.4,18.7,1.9,2.4,2.1,0,,100,7
4,2013,NISSAN,LEAF,MID-SIZE,80,19.3,23.0,21.1,2.2,2.6,2.4,0,,117,7
5,2013,SMART,FORTWO ELECTRIC DRIVE CABRIOLET,TWO-SEATER,35,17.2,22.5,19.6,1.9,2.5,2.2,0,,109,8
6,2013,SMART,FORTWO ELECTRIC DRIVE COUPE,TWO-SEATER,35,17.2,22.5,19.6,1.9,2.5,2.2,0,,109,8
7,2013,TESLA,MODEL S (40 kWh battery),FULL-SIZE,270,22.4,21.9,22.2,2.5,2.5,2.5,0,,224,6
8,2013,TESLA,MODEL S (60 kWh battery),FULL-SIZE,270,22.2,21.7,21.9,2.5,2.4,2.5,0,,335,10
9,2013,TESLA,MODEL S (85 kWh battery),FULL-SIZE,270,23.8,23.2,23.6,2.7,2.6,2.6,0,,426,12


In [137]:
e_cars.dtypes

YEAR                   int64
Make                  object
Model                 object
Size                  object
(kW)                   int64
CITY (kWh/100 km)    float64
HWY (kWh/100 km)     float64
COMB (kWh/100 km)    float64
CITY (Le/100 km)     float64
HWY (Le/100 km)      float64
COMB (Le/100 km)     float64
(g/km)                 int64
RATING               float64
(km)                   int64
TIME (h)               int64
dtype: object

In [138]:
e_cars["Size"].value_counts()

FULL-SIZE                21
SUBCOMPACT               10
TWO-SEATER                8
MID-SIZE                  6
COMPACT                   4
STATION WAGON - SMALL     2
SUV - STANDARD            2
Name: Size, dtype: int64

Obzirom da su podaci u prilično dobrom stanju, ovdje možemo reći da nemamo previše posla oko čišćenja podataka, a analizu ćemo ostaviti za neki drugi put.

## Zadatak

Učitati podatke iz datoteke `titanic_data.csv` u `dataframe` i odraditi njihovu normalizaciju, tj čišćenje.

In [139]:
titanic_data = pd.read_csv("podaci/titanic_data.csv")

titanic_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [140]:
titanic_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [141]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [144]:
titanic_data = titanic_data.rename(columns={"Pclass": "Passenger class"})

In [147]:
titanic_data.rename(
    columns={
        "Passenger class": "Passenger Class",
        "SibSp": "Siblings/Spouses Aboard",
        "Parch": "Parents/Children Aboard",
        "Fare": "Fare (£)",
        "Embarked": "Port of Entry"
    },
    inplace=True
)

In [149]:
titanic_data.columns

Index(['PassengerId', 'Survived', 'Passenger Class', 'Name', 'Sex', 'Age',
       'Siblings/Spouses Aboard', 'Parents/Children Aboard', 'Ticket',
       'Fare (£)', 'Cabin', 'Port of Entry'],
      dtype='object')

In [150]:
titanic_data_copy = titanic_data.copy()

In [153]:
titanic_data_copy.columns = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"]

In [154]:
titanic_data_copy

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K,L
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [155]:
titanic_data

Unnamed: 0,PassengerId,Survived,Passenger Class,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Ticket,Fare (£),Cabin,Port of Entry
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [156]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PassengerId              891 non-null    int64  
 1   Survived                 891 non-null    int64  
 2   Passenger Class          891 non-null    int64  
 3   Name                     891 non-null    object 
 4   Sex                      891 non-null    object 
 5   Age                      714 non-null    float64
 6   Siblings/Spouses Aboard  891 non-null    int64  
 7   Parents/Children Aboard  891 non-null    int64  
 8   Ticket                   891 non-null    object 
 9   Fare (£)                 891 non-null    float64
 10  Cabin                    204 non-null    object 
 11  Port of Entry            889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [157]:
titanic_data["Port of Entry"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

Obzirom da `Port of Entry` podaci nemaju prevelikog smisla, idemo ih promijeniti na sljedeći način:
- ako je vrijednost `S`, promijeniti je u `Southampton`
- ako je vrijednost `C`, promijeniti je u `Cherbourg`
- ako je vrijednost `Q`, promijeniti je u `Queenstown`

Prazne vrijednosti ostaviti kakve jesu.

In [167]:
titanic_data.loc[titanic_data["Port of Entry"] == "S", "Port of Entry"] = "Southampon"

In [168]:
titanic_data

Unnamed: 0,PassengerId,Survived,Passenger Class,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Ticket,Fare (£),Cabin,Port of Entry
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,Southampon
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,Southampon
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,Southampon
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,Southampon
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,Southampon
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,Southampon
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,Southampon
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [165]:
titanic_data_copy

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K,L
0,bla,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton
1,bla,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,bla,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton
3,bla,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton
4,bla,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton
...,...,...,...,...,...,...,...,...,...,...,...,...
886,bla,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton
887,bla,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton
888,bla,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton,Southampton
889,bla,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
