<img height="180px" src="https://drive.google.com/uc?export=view&id=141XOz6N4nk8Ru1sAl7vOsAToCLrSFCAX" alt="SDA logo" align="left" hspace="30px" vspace="50px"/>

# Welcome to your next notebook with SDA!

During the classes we will mostly use [Google Colaboratory](https://colab.research.google.com/?hl=en) which is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.

However, for bigger projects, especially involving Deep Learning and/or big data reading, it might be a better choice to setup Jupyter Notebook or Jupyter Lab on your computer. Also, it is worth noticing that there is a great number of useful extensions (see [nbextensions](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/index.html) and [jupyter-labextension](https://jupyterlab.readthedocs.io/en/stable/user/extensions.html)) not available for Colab users.

<img src="https://drive.google.com/uc?export=view&id=1UO2urRciECzoKE_vHy4RMGfFbkOWOGlW" alt="SDA logo" align="left" width="100px" hspace="10px" vspace="10px"/>

# AI Engineer
## Module 2: Data Processing and AI/ML Models
### Data Processing with Pandas

After the **<font color='#ed7d31'>Data Processing with Pandas</font>** course you will dive deeper into Python and familiarize yourself with concepts especially useful for future Data Analysts, Data Scientists and AI Engineers:
* data cleaning,
* data manipulation
* basic data analysis,
* Pandas data structures,
  * `pd.Series()`
  * `pd.DataFrame()`
* Pandas functions and methods.

<img src="https://drive.google.com/uc?export=view&id=1UO2urRciECzoKE_vHy4RMGfFbkOWOGlW" alt="SDA logo" align="left" width="100px" hspace="10px" vspace="10px"/>
<br>

# Operations on files

<br><br>


#### **<font color='#306998'>MOUNT </font><font color='#ffd33b'>GOOGLE DRIVE</font>**

Download `sentences.txt` file from the following kaggle site:<br>
https://www.kaggle.com/datasets/olgabelitskaya/toy-data-for-text-processing

Upload it to your Google Drive, mount it and load the data.

*Hint: you can copy the path to the file or change the working directory and then use just the file name with .txt extenstion.*

In [6]:
# from google.colab import drive
# drive.mount('/content/drive')

<img src="https://drive.google.com/uc?export=view&id=1UO2urRciECzoKE_vHy4RMGfFbkOWOGlW" alt="SDA logo" align="left" width="100px" hspace="10px" vspace="10px"/>
<br>

# pandas - Python Data Analysis Library

<img src="https://drive.google.com/uc?export=view&id=1-IymXq4uU75EOw2ZEDoKt28FS4AGcVDF" alt="Pandas logo" title="Pandas logo" align="right" width="200px" hspace="10px" vspace="0px"/>

<br><br>
Pandas is another library essential for data analysis using Python. It provides powerful data structures that make working with **<font color='#5b9bd5'>tabular data</font>** simple and intuitive.

<img src="https://drive.google.com/uc?export=view&id=1s2kYiPt5tJQcMb8CHwk91acjIpso4equ" alt="Pandas: tabular data" title="Pandas: tabular data" align="center" width="500px" hspace="30px" vspace="10px"/>

> Source: https://www.geeksforgeeks.org/python-pandas-dataframe/

**<font color='#5b9bd5'>Did you know?</font>** - `.csv` files are plain text files with comma-separated values, e.g. "James", "Bond", 007


<img src="https://drive.google.com/uc?export=view&id=141XOz6N4nk8Ru1sAl7vOsAToCLrSFCAX" alt="SDA logo" width="150" align='right'/>

## Pandas data structures - `pd.Series()`

**<font color='#5b9bd5'>Pandas Series</font>** is a one-dimensional data structure (similar to one-dimensional NumPy matrix, ie. vector) that stores the data as well as unique indeces.

Creating with `pd.Series(<content>)`:





In [7]:
import pandas as pd

pd.Series(range(123, 115, -1))

0    123
1    122
2    121
3    120
4    119
5    118
6    117
7    116
dtype: int64

<img src="https://drive.google.com/uc?export=view&id=141XOz6N4nk8Ru1sAl7vOsAToCLrSFCAX" alt="SDA logo" width="150" align='right'/>

## Pandas data structures - `pd.DataFrame()`

The second structure introduced in pandas is **<font color='#5b9bd5'>Pandas DataFrame</font>**. It is a two or more dimensional data structure, most often in the form of a table with rows and columns. Typically, columns have their names and rows have indexes.

Creating - `pd.DataFrame(<content>)`

In [8]:
pd.DataFrame([[5, 2], [7, 13]])

Unnamed: 0,0,1
0,5,2
1,7,13


<img src="https://drive.google.com/uc?export=view&id=141XOz6N4nk8Ru1sAl7vOsAToCLrSFCAX" alt="SDA logo" width="150" align='right'/>

## Reading data with Pandas

To read data into `pd.DataFrame()` you need to have it in a form of `.csv`, or `.tsv` files, databases, excel files etc. You can load them by using pandas API with `pd.read_csv()` function:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Note the list of all possible arguments!

```python
pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)
```

* `filepath_or_buffer`
* `sep` / `delimiter` - column separator (comma by default)
* `header` - row with columns names (if available)
* `index_col` - to specify which column contains data indices
* `usecols` - if you need to load only selected columns
* `skiprows` - if initial rows contain some meta-data that you do not need
* `nrows` - to load only a limited portion of a data
* `chunksize` - if you want to instantiate data reader (generator) to load data in chunks
* `names` - if they are not available or you want to change the default ones
* and many others!

#### **<font color='#306998'>TASK </font><font color='#ffd33b'>FOR YOU</font>**

Download `chipotle.tsv` file from the following kaggle site:
https://www.kaggle.com/datasets/navneethc/chipotle

Upload it to your Google Drive, mount it and load the data.

*Hint: you can copy the path to the file or change the working directory and then use just the file name with .tsv extenstion.*

In [9]:
# from google.colab import drive
# drive.mount('/content/drive')

In [10]:
chipotle = pd.read_csv("../dane/chipotle.tsv", sep="\t")
chipotle

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
...,...,...,...,...,...
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75


<img src="https://drive.google.com/uc?export=view&id=141XOz6N4nk8Ru1sAl7vOsAToCLrSFCAX" alt="SDA logo" width="150" align='right'/>

## Display data

After the data is loaded, the next step is to display it and you can do it in many ways:
* simply print out what will fit on the screen with `df` (here `chipotle`)
* `df.head(n)` - displays the first n data records (5 by default)
* `df.tail(n)` - displays the last n data records (5 by default)
* `df.sample(n)` - displays n randomly chosen data records (1 by default)
* `df[“column”]` or `df.column` - displays a given column (as `pd.Series()`)
    * note that so-called the dot-notation is possible only for column names without whitespace characters❗
* `df[[“column”]]` - displays a given column (as `pd.DataFrame()`)
* `df[[“column1”, “column2”]]` - displays several columns
    * note that you cannot specify multiple columns with single-square brackets notation❗
* `df.index` - displays indexes
* `df.columns` - displays column names
* `df.info()` - general information about the data set (incl. missing values and data types)

In [11]:
chipotle.sample()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
4315,1721,1,Veggie Bowl,"[Roasted Chili Corn Salsa, [Fajita Vegetables,...",$8.75


In [12]:
chipotle.index

RangeIndex(start=0, stop=4622, step=1)

In [13]:
chipotle.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

In [14]:
chipotle.item_price

0        $2.39 
1        $3.39 
2        $3.39 
3        $2.39 
4       $16.98 
         ...   
4617    $11.75 
4618    $11.75 
4619    $11.25 
4620     $8.75 
4621     $8.75 
Name: item_price, Length: 4622, dtype: object

In [15]:
chipotle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB


### Subset data rows/cols with `.loc[]/.iloc[]`

For the `pd.DataFrame()`, there are two pandas methods for retrieving specific data:
* `.loc[[rows],[cols]]` - looks for columns and indexes by their names
* `.iloc[[row_numbers], [col_numbers]]` - searches by their ordinal numbers

In [16]:
chipotle['item_price']

0        $2.39 
1        $3.39 
2        $3.39 
3        $2.39 
4       $16.98 
         ...   
4617    $11.75 
4618    $11.75 
4619    $11.25 
4620     $8.75 
4621     $8.75 
Name: item_price, Length: 4622, dtype: object

In [17]:
chipotle[0:3]  # rows from 0 to 3, wo 3 (left-hand closed right-hand open)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39


In [18]:
chipotle.loc[:, 'item_price']

0        $2.39 
1        $3.39 
2        $3.39 
3        $2.39 
4       $16.98 
         ...   
4617    $11.75 
4618    $11.75 
4619    $11.25 
4620     $8.75 
4621     $8.75 
Name: item_price, Length: 4622, dtype: object

In [19]:
chipotle.loc[:5, 'item_price']

0     $2.39 
1     $3.39 
2     $3.39 
3     $2.39 
4    $16.98 
5    $10.98 
Name: item_price, dtype: object

In [20]:
chipotle.loc[:, ['quantity', 'item_price']]

Unnamed: 0,quantity,item_price
0,1,$2.39
1,1,$3.39
2,1,$3.39
3,1,$2.39
4,2,$16.98
...,...,...
4617,1,$11.75
4618,1,$11.75
4619,1,$11.25
4620,1,$8.75


In [21]:
chipotle.loc[13, ['quantity', 'item_price']]  # only one row

quantity            1
item_price    $11.25 
Name: 13, dtype: object

In [22]:
chipotle.loc[13, ['item_price']]  # selected single cell (with indices)

item_price    $11.25 
Name: 13, dtype: object

In [23]:
chipotle.loc[13, 'item_price']  # selected single cell

'$11.25 '

In [24]:
chipotle.at[13, 'item_price']  # selected single cell

'$11.25 '

Note that `.iloc[0]` and `.loc[0]` will not differ if the following indexes are consecutive numbers starting from 0.

In [25]:
chipotle.iloc[0]

order_id                                         1
quantity                                         1
item_name             Chips and Fresh Tomato Salsa
choice_description                             NaN
item_price                                  $2.39 
Name: 0, dtype: object

In [26]:
chipotle.loc[0]

order_id                                         1
quantity                                         1
item_name             Chips and Fresh Tomato Salsa
choice_description                             NaN
item_price                                  $2.39 
Name: 0, dtype: object

However, if we start indexing from 1 (or shuffle data, or drop a few rows) these two will be significantly different.

In [27]:
chipotle.index += 1

In [28]:
chipotle.iloc[1]

order_id                         1
quantity                         1
item_name                     Izze
choice_description    [Clementine]
item_price                  $3.39 
Name: 2, dtype: object

In [29]:
chipotle.loc[1]

order_id                                         1
quantity                                         1
item_name             Chips and Fresh Tomato Salsa
choice_description                             NaN
item_price                                  $2.39 
Name: 1, dtype: object

### Data filtering with logical conditions

Note that you can easily introduce various logical expressions to evaluate each cell value one-by-one:

In [30]:
chipotle["quantity"] > 5

1       False
2       False
3       False
4       False
5       False
        ...  
4618    False
4619    False
4620    False
4621    False
4622    False
Name: quantity, Length: 4622, dtype: bool

What is more, you can use it in square brackets to select only the rows satisfying the given conditions:

In [31]:
chipotle[chipotle["quantity"] > 5]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
3599,1443,15,Chips and Fresh Tomato Salsa,,$44.25
3600,1443,7,Bottled Water,,$10.50
3888,1559,8,Side of Chips,,$13.52
4153,1660,10,Bottled Water,,$15.00


In [32]:
chipotle[chipotle["order_id"] <= 2]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
1,1,1,Chips and Fresh Tomato Salsa,,$2.39
2,1,1,Izze,[Clementine],$3.39
3,1,1,Nantucket Nectar,[Apple],$3.39
4,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
5,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


In [33]:
chipotle[(chipotle["order_id"] > 1405) & (chipotle["order_id"] <= 1410)]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
3502,1406,1,Steak Burrito,"[[Lettuce, Fajita Veggies, Rice]]",$8.69
3503,1406,1,Steak Salad,"[[Lettuce, Fajita Veggies]]",$8.69
3504,1407,1,Steak Crispy Tacos,[Fresh Tomato Salsa],$9.25
3505,1407,1,Chips and Fresh Tomato Salsa,,$2.95
3506,1408,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Black Beans, Rice...",$10.98
3507,1408,1,Chips and Fresh Tomato Salsa,,$2.39
3508,1409,2,Chicken Salad Bowl,"[Tomatillo Green Chili Salsa, [Black Beans, Ch...",$22.50
3509,1410,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Pinto Beans...",$21.96


In [34]:
chipotle[(chipotle["order_id"] > 1830) | (chipotle["quantity"] > 4)]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
2442,970,5,Bottled Water,,$7.50
3599,1443,15,Chips and Fresh Tomato Salsa,,$44.25
3600,1443,7,Bottled Water,,$10.50
3888,1559,8,Side of Chips,,$13.52
4153,1660,10,Bottled Water,,$15.00
4613,1831,1,Carnitas Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",$9.25
4614,1831,1,Chips,,$2.15
4615,1831,1,Bottled Water,,$1.50
4616,1832,1,Chicken Soft Tacos,"[Fresh Tomato Salsa, [Rice, Cheese, Sour Cream]]",$8.75
4617,1832,1,Chips and Guacamole,,$4.45


In [35]:
chipotle[chipotle["item_name"] == "Bottled Water"]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
35,17,1,Bottled Water,,$1.09
88,38,1,Bottled Water,,$1.09
319,138,1,Bottled Water,,$1.09
330,143,1,Bottled Water,,$1.50
377,163,1,Bottled Water,,$1.50
...,...,...,...,...,...
4569,1817,1,Bottled Water,,$1.50
4571,1817,1,Bottled Water,,$1.50
4583,1822,2,Bottled Water,,$3.00
4599,1826,1,Bottled Water,,$1.50


<img src="https://drive.google.com/uc?export=view&id=141XOz6N4nk8Ru1sAl7vOsAToCLrSFCAX" alt="SDA logo" width="150" align='right'/>

## Basic statistics

All the following operations for the DataFrame will return values for each of the columns. We can also select the column for which we want to perform these operations:
```python
df.count()  # counting the number of items (not NaNs)
df.column.value_counts()  # counting the number of unique items
df.sum()  # sum of all items
df.min()  # minimal element
df.max()  # maximal element
df.mean()  # dataset’s mean
df.median()  # dataset’s median
```

In [36]:
chipotle['quantity'].value_counts()

quantity
1     4355
2      224
3       28
4       10
5        1
15       1
7        1
8        1
10       1
Name: count, dtype: int64

In [37]:
chipotle['quantity'].sum()

4972

In [38]:
chipotle['quantity'].min(), chipotle['quantity'].max()

(1, 15)

In [39]:
chipotle['quantity'].mean()

1.0757247944612722

**<font color='#5b9bd5'>Please see [TASK 1](#scrollTo=z5eXe9__yhZ6) & [TASK 2](#scrollTo=o5y-XYr4bcv1).</font>**

<img src="https://drive.google.com/uc?export=view&id=1UO2urRciECzoKE_vHy4RMGfFbkOWOGlW" alt="SDA logo" align="left" width="100px" hspace="10px" vspace="10px"/>
<br>

# TASKS

## **<font color='#306998'>TASK </font><font color='#ffd33b'>1</font>**

Create a dataframe with the 10 names of pupils and the number of points they obtained in the exam. Then check what the mean and median of the results were.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

```python
df = pd.DataFrame(data)
```

In [40]:
import pandas as pd

In [41]:
data = {
    "Pupil": ["Sebastian", "Ela", "Celina", "Jolanta", "Izabela", "Dariusz", "Szymon", "Dawid", "Wiktoria", "Weronika"],
    "Points": [10, 12, 22, 20, 18, 9, 8, 21, 11, 14]
}
df = pd.DataFrame(data)

In [42]:
df

Unnamed: 0,Pupil,Points
0,Sebastian,10
1,Ela,12
2,Celina,22
3,Jolanta,20
4,Izabela,18
5,Dariusz,9
6,Szymon,8
7,Dawid,21
8,Wiktoria,11
9,Weronika,14


In [43]:
avg = df["Points"].mean()
avg

14.5

In [44]:
df["Points"].median()

13.0

## **<font color='#306998'>TASK </font><font color='#ffd33b'>2</font>**

* Based on the data frame created in the previous task, display the fourth line.
* Then write those students who scored above average to a separate frame.

In [45]:
df.iloc[3]

Pupil     Jolanta
Points         20
Name: 3, dtype: object

In [46]:
above_avg = df[df["Points"] > avg]
above_avg

Unnamed: 0,Pupil,Points
2,Celina,22
3,Jolanta,20
4,Izabela,18
7,Dawid,21


## **<font color='#306998'>TASK </font><font color='#ffd33b'>3</font>**

Create two objects of type `pd.Series()`, containing 100 values of 0 or 1. Use a random number generator from the NumPy library, eg. `np.random.randint()`. Then display all the indexes on which the values in both `pd.Series()` match.

In [47]:
import numpy as np

In [48]:
first = pd.Series(np.random.randint(2, size=100))
second = pd.Series(np.random.randint(2, size=100))

In [49]:
second[first == second].index

Index([ 3,  4,  5,  9, 12, 13, 14, 15, 17, 19, 20, 21, 23, 27, 28, 32, 33, 34,
       36, 37, 39, 40, 41, 42, 44, 45, 46, 52, 53, 54, 56, 57, 59, 60, 61, 62,
       64, 65, 66, 67, 71, 73, 74, 79, 80, 81, 82, 83, 84, 85, 87, 88, 90, 91,
       94, 97, 98, 99],
      dtype='int64')

## **<font color='#306998'>Information for TASKs </font><font color='#ffd33b'>4-10</font>**

The following data was made available on:</br> https://www.kaggle.com/datasets/davidbnn92/weather-data-for-covid19-data-analysis/.

Try downloading it from there using the `Download` button or with the API command
```python
!kaggle datasets download -d davidbnn92/weather-data-for-covid19-data-analysis
```
(see https://www.kaggle.com/discussions/general/74235 for more details).

If the data is no longer available you can always download it from our Google Drive https://drive.google.com/drive/folders/1KXr6yUW7rE0LzUzuuMehEzam8D9zxGpv?usp=sharing.

### About Dataset

The dataset contains selected metereological features, such as temperature or wind speed, and was imported from the `NOAA GSOD dataset`, continuously updated to include recent measurments.

> Among others, you can find the following columns here:
* `Id`
* `Country/Region`
* `Date`
* `temp`: Mean temperature for the day in degrees Fahrenheit to tenths.
* `max`: Maximum temperature reported during the day.
* `min`: Minimum temperature reported during the day.
* `stp`: Mean station pressure for the day in millibars to tenths.
* `slp`: Mean sea level pressure for the day.
* `dewp`: Mean dew point for the day in [Fahrenheit to tenths].
* `wdsp`: Mean wind speed for the day in [knots to tenths].
* `prcp`: Total precipitation (rain and/or melted snow) reported during the day in [inches and hundredths]; `.00` indicates no measurable precipitation (includes a trace).
* `fog`: Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day.
>
> Note that time of max/min temperatures varies by country and region, so this will sometimes not be the max for the calendar day.

## **<font color='#306998'>TASK </font><font color='#ffd33b'>4</font>**

Explore the file `training_data_with_weather_info_week_4.csv` to find weather data related to Bangladesh.
1. Assign the dataframe to the variable `df_bangladesh`. To load the data, copy the file link and use `pd.read_csv()`.
1. Only load the specified columns by utilizing the `usecols` parameter (you can also use `names` to change column names to more clear ones).
1. Display the first 15 records of the dataset.
1. Determine the total number of entries (aka observations, rows, data points) in the dataset.
1. List all column names.
1. Describe how the dataset is indexed.
1. How many different countries are in this dataset? How many rows correspond to each country?
1. What country does the last entry in the table refer to?

In [50]:
data = pd.read_csv("../dane/training_data_with_weather_info_week_4.csv", usecols=["Id", "Country_Region", "Date", "temp", "max", "min", "stp", "slp", "dewp", "wdsp", "prcp", "fog"])

In [51]:
data.head(15)

Unnamed: 0,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog
0,1,Afghanistan,2020-01-22,42.6,33.6,54.9,999.9,1024.3,27.4,9.4,0.0,0
1,2,Afghanistan,2020-01-23,42.0,32.7,55.9,999.9,1020.8,22.8,14.9,99.99,1
2,3,Afghanistan,2020-01-24,40.1,36.9,43.2,999.9,1018.6,34.5,10.4,0.17,1
3,4,Afghanistan,2020-01-25,46.0,37.9,56.3,999.9,1018.0,37.8,6.1,0.57,1
4,5,Afghanistan,2020-01-26,42.8,36.1,53.1,999.9,1014.8,33.2,10.8,0.0,1
5,6,Afghanistan,2020-01-27,43.0,36.5,50.7,999.9,1015.7,35.6,3.7,0.04,0
6,7,Afghanistan,2020-01-28,41.7,34.7,48.2,999.9,1016.9,34.7,2.4,0.0,0
7,8,Afghanistan,2020-01-29,15.2,13.3,16.9,774.1,1024.7,0.2,2.4,0.0,0
8,9,Afghanistan,2020-01-30,15.2,13.3,16.9,774.1,1024.7,0.2,2.4,0.0,0
9,10,Afghanistan,2020-01-31,5.6,4.8,7.7,774.4,1031.0,-0.6,1.9,0.0,1


In [52]:
data.shape[0]

24414

In [53]:
data.columns

Index(['Id', 'Country_Region', 'Date', 'temp', 'min', 'max', 'stp', 'slp',
       'dewp', 'wdsp', 'prcp', 'fog'],
      dtype='object')

In [54]:
data.index

RangeIndex(start=0, stop=24414, step=1)

In [55]:
data["Country_Region"].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burma', 'Burundi',
       'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Costa Rica',
       "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Denmark',
       'Diamond Princess', 'Djibouti', 'Dominica', 'Dominican Republic',
       'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea',
       'Estonia', 'Eswatini', 'Ethiopia', 'Fiji', 'Finland', 'France',
       'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece',
       'Grenada', 'Guatemala', 'Guinea', 'Guine

In [56]:
len(data["Country_Region"].unique())

184

In [57]:
data["Country_Region"].value_counts()

Country_Region
US                4212
China             2574
Canada             936
France             858
United Kingdom     858
                  ... 
Ghana               78
Greece              78
Grenada             78
Guatemala           78
Zimbabwe            78
Name: count, Length: 184, dtype: int64

In [58]:
data["Country_Region"].iloc[-1]

'Zimbabwe'

In [59]:
data["Country_Region"].tail()

24409    Zimbabwe
24410    Zimbabwe
24411    Zimbabwe
24412    Zimbabwe
24413    Zimbabwe
Name: Country_Region, dtype: object

## **<font color='#306998'>TASK </font><font color='#ffd33b'>5</font>**

In the `df_bangladesh` dataset, change the index values for the rows so that they start at 1.

Then display:
* first 7 rows
* third row both using `.loc[]` and `.iloc[]` - compare the outputs

In [60]:
df_bangladesh = data[data["Country_Region"] == "Bangladesh"]

In [61]:
df_bangladesh.head(7)

Unnamed: 0,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog
1560,2281,Bangladesh,2020-01-22,66.2,57.2,75.2,999.9,,54.7,2.3,0.0,1
1561,2282,Bangladesh,2020-01-23,65.2,57.2,75.2,999.9,,54.7,2.1,0.0,1
1562,2283,Bangladesh,2020-01-24,64.6,55.4,77.0,999.9,,54.6,3.4,0.0,1
1563,2284,Bangladesh,2020-01-25,65.4,55.4,75.2,999.9,,53.3,2.9,0.0,0
1564,2285,Bangladesh,2020-01-26,64.8,55.4,75.2,999.9,,53.5,2.9,0.0,0
1565,2286,Bangladesh,2020-01-27,64.2,55.4,73.4,999.9,,56.8,3.0,0.0,1
1566,2287,Bangladesh,2020-01-28,66.1,57.2,75.2,999.9,,57.0,0.6,0.0,1


In [62]:
df_bangladesh.reset_index(inplace=True)

In [63]:
df_bangladesh.index += 1

In [64]:
df_bangladesh.head(7)

Unnamed: 0,index,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog
1,1560,2281,Bangladesh,2020-01-22,66.2,57.2,75.2,999.9,,54.7,2.3,0.0,1
2,1561,2282,Bangladesh,2020-01-23,65.2,57.2,75.2,999.9,,54.7,2.1,0.0,1
3,1562,2283,Bangladesh,2020-01-24,64.6,55.4,77.0,999.9,,54.6,3.4,0.0,1
4,1563,2284,Bangladesh,2020-01-25,65.4,55.4,75.2,999.9,,53.3,2.9,0.0,0
5,1564,2285,Bangladesh,2020-01-26,64.8,55.4,75.2,999.9,,53.5,2.9,0.0,0
6,1565,2286,Bangladesh,2020-01-27,64.2,55.4,73.4,999.9,,56.8,3.0,0.0,1
7,1566,2287,Bangladesh,2020-01-28,66.1,57.2,75.2,999.9,,57.0,0.6,0.0,1


In [65]:
df_bangladesh.loc[3]

index                   1562
Id                      2283
Country_Region    Bangladesh
Date              2020-01-24
temp                    64.6
min                     55.4
max                     77.0
stp                    999.9
slp                      NaN
dewp                    54.6
wdsp                     3.4
prcp                     0.0
fog                        1
Name: 3, dtype: object

In [66]:
df_bangladesh.iloc[3]

index                   1563
Id                      2284
Country_Region    Bangladesh
Date              2020-01-25
temp                    65.4
min                     55.4
max                     75.2
stp                    999.9
slp                      NaN
dewp                    53.3
wdsp                     2.9
prcp                     0.0
fog                        0
Name: 4, dtype: object

## **<font color='#306998'>TASK </font><font color='#ffd33b'>6</font>**

View all observations:
* from March 2020 (`df_bangladesh`)
* from Februrary 1 for each country

In [67]:
df_bangladesh[df_bangladesh["Date"] >= "2020-03-01"].sample(5)

Unnamed: 0,index,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog
64,1623,2344,Bangladesh,2020-03-25,81.5,69.8,91.9,7.1,1008.1,64.4,0.6,0.0,0
62,1621,2342,Bangladesh,2020-03-23,77.6,67.1,87.1,9.8,1010.8,63.7,0.5,0.0,0
47,1606,2327,Bangladesh,2020-03-08,75.2,66.2,86.9,10.1,1011.1,66.4,0.6,0.02,0
74,1633,2354,Bangladesh,2020-04-04,85.3,74.8,96.8,9.0,1009.9,70.1,1.8,0.0,0
66,1625,2346,Bangladesh,2020-03-27,85.5,74.1,97.9,7.8,1008.8,63.9,0.8,0.0,0


In [68]:
df_bangladesh["Date"] = df_bangladesh["Date"].astype("datetime64[ns]")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bangladesh["Date"] = df_bangladesh["Date"].astype("datetime64[ns]")


In [69]:
df_bangladesh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 1 to 78
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   index           78 non-null     int64         
 1   Id              78 non-null     int64         
 2   Country_Region  78 non-null     object        
 3   Date            78 non-null     datetime64[ns]
 4   temp            78 non-null     float64       
 5   min             78 non-null     float64       
 6   max             78 non-null     float64       
 7   stp             78 non-null     float64       
 8   slp             50 non-null     float64       
 9   dewp            78 non-null     float64       
 10  wdsp            78 non-null     float64       
 11  prcp            78 non-null     float64       
 12  fog             78 non-null     int64         
dtypes: datetime64[ns](1), float64(8), int64(3), object(1)
memory usage: 8.1+ KB


In [70]:
df_bangladesh["Year"] = df_bangladesh["Date"].dt.year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bangladesh["Year"] = df_bangladesh["Date"].dt.year


In [71]:
df_bangladesh["Month"] = df_bangladesh["Date"].dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_bangladesh["Month"] = df_bangladesh["Date"].dt.month


In [72]:
df_bangladesh.head()

Unnamed: 0,index,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog,Year,Month
1,1560,2281,Bangladesh,2020-01-22,66.2,57.2,75.2,999.9,,54.7,2.3,0.0,1,2020,1
2,1561,2282,Bangladesh,2020-01-23,65.2,57.2,75.2,999.9,,54.7,2.1,0.0,1,2020,1
3,1562,2283,Bangladesh,2020-01-24,64.6,55.4,77.0,999.9,,54.6,3.4,0.0,1,2020,1
4,1563,2284,Bangladesh,2020-01-25,65.4,55.4,75.2,999.9,,53.3,2.9,0.0,0,2020,1
5,1564,2285,Bangladesh,2020-01-26,64.8,55.4,75.2,999.9,,53.5,2.9,0.0,0,2020,1


In [73]:
df_bangladesh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 1 to 78
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   index           78 non-null     int64         
 1   Id              78 non-null     int64         
 2   Country_Region  78 non-null     object        
 3   Date            78 non-null     datetime64[ns]
 4   temp            78 non-null     float64       
 5   min             78 non-null     float64       
 6   max             78 non-null     float64       
 7   stp             78 non-null     float64       
 8   slp             50 non-null     float64       
 9   dewp            78 non-null     float64       
 10  wdsp            78 non-null     float64       
 11  prcp            78 non-null     float64       
 12  fog             78 non-null     int64         
 13  Year            78 non-null     int32         
 14  Month           78 non-null     int32         
dtypes: datet

In [74]:
df_bangladesh[(df_bangladesh["Year"] == 2020) & (df_bangladesh["Month"] >= 3)].head()

Unnamed: 0,index,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog,Year,Month
40,1599,2320,Bangladesh,2020-03-01,78.0,65.5,88.5,10.0,1010.9,59.0,1.5,0.0,0,2020,3
41,1600,2321,Bangladesh,2020-03-02,77.3,66.7,87.6,9.8,1010.8,62.2,1.1,0.0,0,2020,3
42,1601,2322,Bangladesh,2020-03-03,74.8,64.0,85.6,9.3,1010.2,59.5,2.8,0.54,1,2020,3
43,1602,2323,Bangladesh,2020-03-04,75.7,68.0,87.8,9.9,1010.9,61.9,2.5,0.48,1,2020,3
44,1603,2324,Bangladesh,2020-03-05,76.9,65.7,86.2,9.9,1010.9,63.9,1.2,0.0,0,2020,3


In [75]:
data["Date"] = data["Date"].astype("datetime64[ns]")

In [76]:
data["Month"] = data["Date"].dt.month

In [77]:
data[data['Month'] >= 2].head(10)

Unnamed: 0,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog,Month
10,11,Afghanistan,2020-02-01,5.6,4.8,7.7,774.4,1031.0,-0.6,1.9,0.0,1,2
11,12,Afghanistan,2020-02-02,20.5,3.9,36.0,773.4,1020.7,12.6,4.5,1.73,1,2
12,13,Afghanistan,2020-02-03,20.5,3.9,36.0,773.4,1020.7,12.6,4.5,1.73,1,2
13,14,Afghanistan,2020-02-04,4.2,-0.6,12.9,775.4,1033.3,-1.3,3.3,0.0,1,2
14,15,Afghanistan,2020-02-05,4.2,-0.6,12.9,775.4,1033.3,-1.3,3.3,0.0,1,2
15,16,Afghanistan,2020-02-06,43.1,29.1,58.1,999.9,1020.9,22.0,1.7,0.0,0,2
16,17,Afghanistan,2020-02-07,44.5,31.6,60.3,999.9,1022.4,18.4,5.7,0.0,0,2
17,18,Afghanistan,2020-02-08,47.8,34.5,63.5,999.9,1021.2,17.2,7.7,0.0,0,2
18,19,Afghanistan,2020-02-09,36.9,31.1,42.1,773.3,1011.0,29.4,4.3,0.08,1,2
19,20,Afghanistan,2020-02-10,36.9,31.1,42.1,773.3,1011.0,29.4,4.3,0.08,1,2


## **<font color='#306998'>TASK </font><font color='#ffd33b'>7</font>**

Check how many rows contain missing data in `slp` column. Delete those rows and reset the rows indexes (see the `.reset_index()` method).


In [78]:
data["slp"].isna().sum()

10105

In [79]:
data1 = data[~ data["slp"].isna()]

In [80]:
data1.shape[0]

14309

In [81]:
data1["slp"].isna().sum()

0

In [82]:
data1.reset_index(inplace=True)

In [83]:
data1.head()

Unnamed: 0,index,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog,Month
0,0,1,Afghanistan,2020-01-22,42.6,33.6,54.9,999.9,1024.3,27.4,9.4,0.0,0,1
1,1,2,Afghanistan,2020-01-23,42.0,32.7,55.9,999.9,1020.8,22.8,14.9,99.99,1,1
2,2,3,Afghanistan,2020-01-24,40.1,36.9,43.2,999.9,1018.6,34.5,10.4,0.17,1,1
3,3,4,Afghanistan,2020-01-25,46.0,37.9,56.3,999.9,1018.0,37.8,6.1,0.57,1,1
4,4,5,Afghanistan,2020-01-26,42.8,36.1,53.1,999.9,1014.8,33.2,10.8,0.0,1,1


In [84]:
data = data.dropna(subset="slp").reset_index(drop=True)

In [85]:
data.shape[0]

14309

In [86]:
data["slp"].isna().sum()

0

## **<font color='#306998'>TASK </font><font color='#ffd33b'>8</font>**

Count the average temperature based on all `temp` entries and pull out those records where the temperature is below average. How many are there?

In [87]:
avg = data["temp"].mean()
avg

55.46214270738696

In [88]:
below_avg = data[data["temp"] < avg]

In [89]:
below_avg.shape[0]

7218

In [90]:
below_avg.sample(10)

Unnamed: 0,Id,Country_Region,Date,temp,min,max,stp,slp,dewp,wdsp,prcp,fog,Month
6010,13819,France,2020-02-15,17.8,12.2,23.2,19.1,1020.0,8.7,24.7,0.0,0,2
8095,19871,Moldova,2020-02-25,42.6,26.1,56.1,0.5,1010.3,25.6,4.1,0.0,0,2
12253,30228,US,2020-02-08,27.0,17.1,45.0,949.6,1019.0,18.2,6.8,0.0,0,2
13296,32746,US,2020-02-18,28.5,19.0,30.9,885.5,1027.3,22.3,2.3,0.0,0,2
12019,29771,US,2020-02-07,20.4,17.1,26.1,969.9,1008.8,13.4,4.1,0.0,1,2
3281,6964,China,2020-01-31,44.5,34.2,50.0,879.7,1020.3,33.2,4.4,0.0,1,1
5137,10989,Denmark,2020-03-06,39.2,32.5,45.1,997.0,1003.4,34.6,6.9,0.02,0,3
9204,22837,Poland,2020-02-27,36.2,30.2,39.2,983.8,999.0,31.7,9.4,99.99,1,2
14005,34382,United Kingdom,2020-03-29,39.7,36.7,43.7,30.7,1037.3,27.6,14.6,0.02,0,3
11487,28644,US,2020-02-20,20.4,14.0,27.0,5.5,1038.7,8.8,11.0,0.0,0,2


## **<font color='#306998'>TASK </font><font color='#ffd33b'>9</font>**

Check if there are any countries and days where the minimum temperature reported during the day (`min`) was higher than the average temperature counted as in task 8 for the whole table.

In [91]:
data[data["min"] > avg]["Country_Region"].unique()

array(['Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina',
       'Australia', 'Bahamas', 'Bangladesh', 'Barbados', 'Belize',
       'Benin', 'Bolivia', 'Brazil', 'Burkina Faso', 'Burma',
       'Cabo Verde', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Costa Rica', "Cote d'Ivoire", 'Cuba', 'Djibouti', 'Dominica',
       'Egypt', 'Equatorial Guinea', 'Eritrea', 'Eswatini', 'Ethiopia',
       'Fiji', 'France', 'Gambia', 'Ghana', 'Grenada', 'Guinea',
       'Guinea-Bissau', 'Guyana', 'Haiti', 'India', 'Indonesia', 'Iraq',
       'Israel', 'Jamaica', 'Jordan', 'Kuwait', 'Laos', 'Liberia',
       'Libya', 'MS Zaandam', 'Malaysia', 'Maldives', 'Mozambique',
       'Netherlands', 'New Zealand', 'Niger', 'Nigeria', 'Oman', 'Panama',
       'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Qatar',
       'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Vincent and the Grenadines', 'Senegal', 'Seychelles',
       'Sierra Leone', 'Somalia', 'Spain', 'Suriname', '

In [92]:
data[(data["Country_Region"] == "Algeria") & (data["min"] > avg)]["min"].count()

41

## **<font color='#306998'>TASK </font><font color='#ffd33b'>10</font>**

Check how many times `fog` has been observed and what percentage of all entries it represents.

In [93]:
foggy_days = data[data["fog"] == True]["fog"].count()
foggy_days

4776

In [94]:
entries = data.shape[0]
entries

14309

In [95]:
foggy_days / entries * 100

33.377594520930884