# Pandas and basic tabular data manipulation

Before any more complex analysis, it is necessary to learn the basics of working with processed data - in our case, it will be tables.

## Import libraries

In normal programming, we try to avoid aliases because they reduce the readability of the code for other programmers. It&#39;s different with data analytics, because using one alias, which is also very common, saves us a lot of typing. In addition, our analyzes will result in tables and graphs rather than code.

In [1]:
import pandas as pd

## Load the data table

For reading data, Pandas has a number of `read_ *` functions, thanks to which it can handle a number of different formats. However, the most common is still the CSV format, in which the individual values are separated by commas.
If you are experimenting with this laptop, first download the data file from [this link] (static / pokemon.csv). Experimental data are generated from [Comprehensive Pokedex at Github] (https://github.com/veekun/pokedex).

In [2]:
data = pd.read_csv("static/pokemon.csv")

After loading the data, it makes sense to look at them immediately to see that everything is fine. It may happen that the first line with column names will be missing in the CSV file, the values will be separated by a non-comma character and many other eventualities. However, our CSV file is precisely prepared, so when we view it on a laptop, we will see a complete table where everything is as it should be.

In [3]:
data

Unnamed: 0,id,name,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed
0,1,bulbasaur,0.7,6.9,green,quadruped,False,Grass,Poison,45,49,49,45
1,2,ivysaur,1.0,13.0,green,quadruped,False,Grass,Poison,60,62,63,60
2,3,venusaur,2.0,100.0,green,quadruped,False,Grass,Poison,80,82,83,80
3,4,charmander,0.6,8.5,red,upright,False,Fire,,39,52,43,65
4,5,charmeleon,1.1,19.0,red,upright,False,Fire,,58,64,58,80
...,...,...,...,...,...,...,...,...,...,...,...,...,...
802,803,poipole,0.6,1.8,purple,upright,False,Poison,,67,73,67,73
803,804,naganadel,3.6,150.0,purple,wings,False,Poison,Dragon,73,73,73,121
804,805,stakataka,5.5,820.0,gray,quadruped,False,Rock,Steel,61,131,211,13
805,806,blacephalon,1.8,13.0,white,humanoid,False,Fire,Ghost,53,127,53,107


The basic display in the notebook shows us the first five and last five rows in the table, along with information about the total number of rows and columns. Our table is intentionally small enough to work comfortably, which is usually not the case. Therefore, we will look at how to choose from the table only what interests us.

## Columns and rows

Columns and rows have a so-called index, which can be used to access individual parts of the table. In our case, each row has an automatically added unique numeric index (the nameless column on the left) and the first row of the CSV file with the column names will serve as the index for the columns.
Later we will learn to change ideas and insert more than one value into them.

### Columns

Columns can be selected in the same way as values are selected from the dictionary - just write the column name in square brackets.

In [4]:
data['name']

0        bulbasaur
1          ivysaur
2         venusaur
3       charmander
4       charmeleon
          ...     
802        poipole
803      naganadel
804      stakataka
805    blacephalon
806        zeraora
Name: name, Length: 807, dtype: object

If we need more columns at once, we put a list of them in square brackets after the variable with the table.

In [5]:
data[['name', 'color']]

Unnamed: 0,name,color
0,bulbasaur,green
1,ivysaur,green
2,venusaur,green
3,charmander,red
4,charmeleon,red
...,...,...
802,poipole,purple
803,naganadel,purple
804,stakataka,gray
805,blacephalon,white


If the column name does not contain a space and does not conflict with the name of an existing method, it can also be accessed using dot notation, which is very useful especially when writing more complex conditions.

In [6]:
data.name

0        bulbasaur
1          ivysaur
2         venusaur
3       charmander
4       charmeleon
          ...     
802        poipole
803      naganadel
804      stakataka
805    blacephalon
806        zeraora
Name: name, Length: 807, dtype: object

This approach is not applicable to the `shape` column because it conflicts with an attribute that stores the number of rows and columns in the table.

In [7]:
data.shape

(807, 13)

### Lines

The so-called indexer `loc` is used for rows, which can get one or more rows for us.
For example, we can say one line by its index:

In [8]:
data.loc [1]

id                 2
name         ivysaur
height             1
weight            13
color          green
shape      quadruped
is baby        False
type 1         Grass
type 2        Poison
hp                60
attack            62
defense           63
speed             60
Name: 1, dtype: object

Multiple rows listing their indexes:

In [9]:
data.loc [[1, 2, 3]]

Unnamed: 0,id,name,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed
1,2,ivysaur,1.0,13.0,green,quadruped,False,Grass,Poison,60,62,63,60
2,3,venusaur,2.0,100.0,green,quadruped,False,Grass,Poison,80,82,83,80
3,4,charmander,0.6,8.5,red,upright,False,Fire,,39,52,43,65


Or maybe just a few-row lists using a range of indexes:

In [10]:
data.loc [100: 105]

Unnamed: 0,id,name,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed
100,101,electrode,1.2,66.6,red,ball,False,Electric,,60,50,70,150
101,102,exeggcute,0.4,2.5,pink,heads,False,Grass,Psychic,60,40,80,40
102,103,exeggutor,2.0,120.0,yellow,legs,False,Grass,Psychic,95,95,85,55
103,104,cubone,0.4,6.5,brown,upright,False,Ground,,50,50,95,35
104,105,marowak,1.0,45.0,brown,upright,False,Ground,,60,80,110,45
105,106,hitmonlee,1.5,49.8,brown,humanoid,False,Fighting,,50,120,53,87


### Specific values

The `loc` indexer is also suitable for obtaining a specific value, only both indexes (for row and column) are used at the same time in the form of ntice.

In [11]:
data.loc[101, 'name']

'exeggcute'

## Data filtering

Very often it is useful to get a part of the table in which the values meet certain criteria. This can be achieved by writing a condition in square brackets and getting only those lines that satisfy the condition.
For example, we will only be interested in red Pokémon.

In [12]:
data[data['color'] == "red"]

Unnamed: 0,id,name,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed
3,4,charmander,0.6,8.5,red,upright,False,Fire,,39,52,43,65
4,5,charmeleon,1.1,19.0,red,upright,False,Fire,,58,64,58,80
5,6,charizard,1.7,90.5,red,upright,False,Fire,Flying,78,84,78,100
44,45,vileplume,1.2,18.6,red,humanoid,False,Grass,Poison,75,80,85,50
45,46,paras,0.3,5.4,red,armor,False,Bug,Grass,35,70,55,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...
726,727,incineroar,1.8,83.0,red,upright,False,Fire,Dark,95,115,90,60
740,741,oricorio,0.6,3.4,red,wings,False,Fire,Flying,75,70,70,93
775,776,turtonator,2.0,212.0,red,upright,False,Fire,Dragon,60,78,135,36
786,787,tapu-bulu,1.9,45.5,red,arms,False,Grass,Fairy,70,130,115,75


Conditions can also be combined, but certain rules must be followed. The first is that each condition must be enclosed in parentheses. Furthermore, the logical conjunctions `and` and` or` are replaced by the characters `&amp;` and `|`.
What about red Pokémon with an attack power greater than or equal to 130?

In [13]:
data[(data['attack'] >= 130) & (data['color'] == "red")]

Unnamed: 0,id,name,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed
98,99,kingler,1.3,60.0,red,armor,False,Water,,55,130,115,75
135,136,flareon,0.9,25.0,red,quadruped,False,Fire,,65,130,60,65
211,212,scizor,1.8,118.0,red,bug-wings,False,Bug,Steel,70,130,100,65
249,250,ho-oh,3.8,199.0,red,wings,False,Fire,Flying,106,130,90,90
382,383,groudon,3.5,950.0,red,upright,False,Ground,,100,150,140,90
385,386,deoxys,1.7,60.8,red,humanoid,False,Psychic,,50,150,50,150
554,555,darmanitan,1.3,92.9,red,quadruped,False,Fire,,105,140,55,95
716,717,yveltal,5.8,203.0,red,wings,False,Dark,Flying,126,131,95,99
786,787,tapu-bulu,1.9,45.5,red,arms,False,Grass,Fairy,70,130,115,75
793,794,buzzwole,2.4,333.6,red,tentacles,False,Bug,Fighting,107,139,139,79


## Column operations

A condition written in square brackets is not really a common condition. This is a bulk operation across the entire column, which returns `True` or` False` for each row, which, when used for filtering, decides whether to display it.

In [14]:
data.speed >= 65

0      False
1      False
2       True
3       True
4       True
       ...  
802     True
803     True
804    False
805     True
806     True
Name: speed, Length: 807, dtype: bool

Other known operators can be used in the same way and applied to a column and one value or to multiple columns at once.

In [15]:
data["speed"] / 10

0       4.5
1       6.0
2       8.0
3       6.5
4       8.0
       ... 
802     7.3
803    12.1
804     1.3
805    10.7
806    14.3
Name: speed, Length: 807, dtype: float64

In [16]:
data.hp * data.defense

0       2205
1       3780
2       6640
3       1677
4       3364
       ...  
802     4489
803     5329
804    12871
805     2809
806     6600
Length: 807, dtype: int64

The result of such an operation can be very easily added back to the original table as a new column.

In [17]:
data["fast"] = data.speed >= 65

In [18]:
data

Unnamed: 0,id,name,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed,fast
0,1,bulbasaur,0.7,6.9,green,quadruped,False,Grass,Poison,45,49,49,45,False
1,2,ivysaur,1.0,13.0,green,quadruped,False,Grass,Poison,60,62,63,60,False
2,3,venusaur,2.0,100.0,green,quadruped,False,Grass,Poison,80,82,83,80,True
3,4,charmander,0.6,8.5,red,upright,False,Fire,,39,52,43,65,True
4,5,charmeleon,1.1,19.0,red,upright,False,Fire,,58,64,58,80,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
802,803,poipole,0.6,1.8,purple,upright,False,Poison,,67,73,67,73,True
803,804,naganadel,3.6,150.0,purple,wings,False,Poison,Dragon,73,73,73,121,True
804,805,stakataka,5.5,820.0,gray,quadruped,False,Rock,Steel,61,131,211,13,False
805,806,blacephalon,1.8,13.0,white,humanoid,False,Fire,Ghost,53,127,53,107,True


## Sorting

The `sort_values` method is used to sort by values, and the` sort_index` method sorts by index. The first few lines are obtained by the `head` method, and vice versa, the last one is provided by the` tail` method.
So we get the five fastest Pokémon easily by descending order of their speed and limited to the first five lines.

In [19]:
data.sort_values(by="speed", ascending=False).head()

Unnamed: 0,id,name,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed,fast
290,291,ninjask,0.8,12.0,yellow,bug-wings,False,Bug,Flying,61,90,45,160,True
794,795,pheromosa,1.8,25.0,white,humanoid,False,Bug,Fighting,71,137,37,151,True
385,386,deoxys,1.7,60.8,red,humanoid,False,Psychic,,50,150,50,150,True
100,101,electrode,1.2,66.6,red,ball,False,Electric,,60,50,70,150,True
616,617,accelgor,0.8,25.3,red,arms,False,Bug,,80,70,40,145,True


## Index change

Before changing the index, it&#39;s a good idea to make sure that the column with the future index contains only unique values.

In [20]:
data.shape

(807, 14)

In [21]:
data.name.nunique()

807

Because the number of unique values in the Pokémon name column is the same as the total number of rows, we can use the name as an index. This can be done either by setting the `index` attribute or by using the` set_index` method, which can do much more.

In [22]:
data.set_index("name")

Unnamed: 0_level_0,id,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed,fast
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
bulbasaur,1,0.7,6.9,green,quadruped,False,Grass,Poison,45,49,49,45,False
ivysaur,2,1.0,13.0,green,quadruped,False,Grass,Poison,60,62,63,60,False
venusaur,3,2.0,100.0,green,quadruped,False,Grass,Poison,80,82,83,80,True
charmander,4,0.6,8.5,red,upright,False,Fire,,39,52,43,65,True
charmeleon,5,1.1,19.0,red,upright,False,Fire,,58,64,58,80,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
poipole,803,0.6,1.8,purple,upright,False,Poison,,67,73,67,73,True
naganadel,804,3.6,150.0,purple,wings,False,Poison,Dragon,73,73,73,121,True
stakataka,805,5.5,820.0,gray,quadruped,False,Rock,Steel,61,131,211,13,False
blacephalon,806,1.8,13.0,white,humanoid,False,Fire,Ghost,53,127,53,107,True


Many methods return a modified copy of the table after it is modified. Because we would rather modify the table directly in this case, we store the result of the index setting back in the `data` variable.

In [23]:
data = data.set_index("name")

Now we can try sorting with a modified index and also getting specific rows with its help.

In [24]:
data.sort_index()

Unnamed: 0_level_0,id,height,weight,color,shape,is baby,type 1,type 2,hp,attack,defense,speed,fast
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
abomasnow,460,2.2,135.5,white,upright,False,Grass,Ice,90,92,75,60,False
abra,63,0.9,19.5,brown,upright,False,Psychic,,25,20,15,90,True
absol,359,1.2,47.0,white,quadruped,False,Dark,,65,130,60,75,True
accelgor,617,0.8,25.3,red,arms,False,Bug,,80,70,40,145,True
aegislash,681,1.7,53.0,brown,blob,False,Steel,Ghost,60,50,150,60,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
zoroark,571,1.6,81.1,gray,upright,False,Dark,,60,105,60,105,True
zorua,570,0.7,12.5,gray,quadruped,False,Dark,,40,65,40,65,True
zubat,41,0.8,7.5,purple,wings,False,Poison,Flying,40,45,35,55,False
zweilous,634,1.4,50.0,blue,quadruped,False,Dark,Dragon,72,85,70,58,False


In [25]:
data.loc[&quot;pikachu&quot;]

id                25
height           0.4
weight             6
color         yellow
shape      quadruped
is baby        False
type 1      Electric
type 2           NaN
hp                35
attack            55
defense           40
speed             90
fast            True
Name: pikachu, dtype: object

## Continued

These are far from all the data manipulation options that Pandas allows, but it will be enough for a start. We will show more during the exploratory data analysis in the following chapters.

## Time to play

Before the next lesson, it is important to get along a bit with Pandas and get the basic data operations under the skin. In the next lessons, we will show you more advanced methods for working with data that will build on this foundation. After each lesson, it will also be more than appropriate to try the newly acquired knowledge, and you will need to get some data to get started.
There are a plethora of better or worse data sources on the Internet, so everyone is sure to choose. At the beginning, we recommend choosing data that you understand, because then you will be able to focus on solving more advanced tasks during the analysis and you will not have to base on the meaning of the top-named columns.
For a start, you can look here, for example:
* [National Catalog of Open Data (NKOD)] (https://data.gov.cz/datov%C3%A9-sady) contains a lot of data sets from various institutions - [number of paid disability pensions in the Czech Republic by groups of diagnoses] ( https://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A9-sady%2FCSShZbzpcn%2F695492977%2Fe6e6bc9dc686d59dca05bca05b [paid invoices of the Ministry of Defense] (https://data.gov.cz/datov%C3%A1-sada?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A9-sady%2Fhttps --- data.army.cz-api-3-action-package_show-id-uhrazen% C3% A9-faktury-2018) konče.* [The Czech Statistical Office (CSO)] provides some data to the NKOD and others are available from it. You can focus on [election results] (https://volby.cz/opendata/opendata.htm) or [numbers of marriages and divorces] (https://www.czso.cz/csu/czso/databaze-demografickych -udaju-za-obce-cr) or other demographic data.* [Kaggle] (https://www.kaggle.com/datasets) will offer a huge amount of datasets from [forest fires in Brazil] (https://www.kaggle.com/gustavomodelli/forest-fires-in- brazil), through information on [US breweries] (https://www.kaggle.com/brkurzawa/us-breweries) to [wholesale electricity market in Russia] (https://www.kaggle.com/irinachuchueva/ russian-wholesale-electricity-market).
Try to download and explore a dataset. If you like it, keep it, you will be able to play with it more and more sophisticated after completing other lessons.