# Data Wrangling Exercises


## Introduction

Data wrangling is the process of cleaning, transforming, and organizing data to make it more suitable for analysis. It is a critical step in any data analysis project, as it ensures that the data is accurate, consistent, and complete.

These exercises are designed to provide practice in data wrangling skills using a real-world dataset. The dataset used in these exercises is the Slovenian Natural Language Inference dataset (SI-NLI), which contains labeled examples of text pairs with corresponding labels of entailment, contradiction, or neutral.

The exercises cover a range of data wrangling techniques, including importing data, performing basic statistics, subsetting observations and variables, creating new variables, grouping data, and combining datasets.

## Get data

1. Download SI-NLI from [link](https://www.clarin.si/repository/xmlui/handle/11356/1707).
2. Load libraries.
3. Import ```train.tsv``` file.

In [3]:
# import required libraries
import numpy
import pandas

In [5]:
# import train.tsv file
train = pandas.read_csv('SI-NLI/train.tsv', sep='\t')

In [9]:
# print the first 5 rows of the train.tsv file
print(train.head())

  pair_id                                            premise  \
0      P0  Vendar se je anglikanska večina v grofijah na ...   
1      P1  INŠTRUKTOR IZ PRTLJAŽNIKA V DRUGO POTOVALKO PR...   
2      P2  biotska raznovrstnost – v splošnem je to razno...   
3      P3  Preroški pomen: Če v sanjah bedite, je to na s...   
4      P4  Jeseni so dnevi krajši, stemni se že dokaj zgo...   

                                          hypothesis annotation_1  \
0  A na glasovanju o priključitvi ozemlja k Sever...            E   
1  Učitelj je vzel iz prtljažnika iz prve potoval...            C   
2  Četudi je biodiverziteta pomemben del biološke...            C   
3  V preroškem smislu budnost v sanjah nakazuje o...            E   
4  V krajših jesenskih dneh tema nastopi relativn...            E   

                comment_1 annotator1_id annotation_2  \
0                     NaN   annotator_C            E   
1                     NaN   annotator_B            N   
2                     NaN   anno

## Basic statistics

1. How many examples are in a dataframe?
2. How many variables are in a dataframe?
3. Count values in the ```label``` column.
4. Are there any missing values in the data?
5. Count the number of missing values per column.

In [None]:
# How many examples are in the dataframe?
print("Number of examples in the dataframe: ", len(train))

# How many variables are in the dataframe?
print("Number of variables in the dataframe: ", len(train.columns))

# Count values in the label column
print("Count of values in the label column: ", train['label'].value_counts())

# are there any missing values in the dataframe?
print("Are there any missing values in the dataframe: ", train.isnull().values.any())

# count the number of missing values per column
print("Number of missing values per column: ", train.isnull().sum())
# train.isnull() returns a dataframe with True/False values, and sum() counts the number of True values in each column

Number of examples in the dataframe:  4392
Number of variables in the dataframe:  14
Count of values in the label column:  label
entailment       1518
contradiction    1448
neutral          1426
Name: count, dtype: int64
Are there any missing values in the dataframe:  True
Number of missing values per column:  pair_id                0
premise                0
hypothesis             0
annotation_1          91
comment_1           4349
annotator1_id         91
annotation_2          91
comment_2           4274
annotator2_id         91
annotation_3        3981
comment_3           4384
annotator3_id       3976
annotation_FINAL     508
label                  0
dtype: int64


## Subset observations and variables

1. Select ```premise``` column and store it in a list.
2. Print first 3 rows from the first 3 columns.
3. Select ```pair_id```, ```premise```, ```hypothesis```, ```label``` columns and save them into ```train_dataset``` variable.
4. Drop ```pair_id``` column.
5. Convert all columns to uppercase.
6. Replace ```_``` with ```-``` in column names.
7. Select rows that belong to the ```neutral``` label.
8. Select last 30 rows.
9. Select rows with ```hypothesis``` longer than 100 characters.
10. Select rows with ```hypothesis``` longer than 100 characters and belong to the ```neutral``` label.
11. Select the row with the longest ```hypothesis```.
12. Remove rows that contain ```č```, ```š```, ```ž``` in ```premise``` or ```hypothesis```.
13. Remove rows that contain at least one missing value.
14. Remove the column with the most missing values.

In [24]:
# select premise column and store it in a list
premise = train['premise'].tolist()

# print first 3 rows from the first 3 columns
print(train.iloc[:3, [0, 1, 2]])


# select pair_id, premise, hypothesis and label columns and save them into train_dataset variable
train_dataset = train[['pair_id', 'premise', 'hypothesis', 'label']]

# drop pair_id column
train_dataset = train_dataset.drop(columns=['pair_id'])

# convert all columns to uppercase
train_dataset.columns = [i.upper() for i in train_dataset.columns]
train_dataset.head()

# replace _ with - in column names
train_dataset.columns = train_dataset.columns.str.replace('_', '-')
print("\n", train_dataset.columns)

  pair_id                                            premise  \
0      P0  Vendar se je anglikanska večina v grofijah na ...   
1      P1  INŠTRUKTOR IZ PRTLJAŽNIKA V DRUGO POTOVALKO PR...   
2      P2  biotska raznovrstnost – v splošnem je to razno...   

                                          hypothesis  
0  A na glasovanju o priključitvi ozemlja k Sever...  
1  Učitelj je vzel iz prtljažnika iz prve potoval...  
2  Četudi je biodiverziteta pomemben del biološke...  

 Index(['PREMISE', 'HYPOTHESIS', 'LABEL'], dtype='object')


## Create new variables

1. Create integer type variable ```vowel_count_premise``` which stores the number of vowels in a ```premise```. Repeat for ```hypothesis```.
2. Create integer type variable with possible values ```1```, ```2```, ```3``` that counts how many annotations a single example received.
3. Create boolean type variable ```agreement``` which reflects whether all annotators agreed on the label.

## Save dataframes

1. Save the original dataset to disk in a ```csv``` format.
...