# Why We Need Data & Importance of Data Collection

![Alt text](imgs/data_has_better_idea_slide_01.png)

# The Most Popular Data Processing Tools Today in DS/ML

![Alt text](imgs/data_has_better_idea_slide_02.png)

# Data Processing and Manipulation with NumPy, Pandas, Huggingface Datasets, and Working with APIs

## Objectives
- Understand the basics of NumPy and Pandas for data processing.
- Learn how to use Huggingface Datasets for data manipulation.
- Explore working with APIs for data acquisition.
- Apply knowledge through hands-on examples and exercises.ng.



<!-- ![Alt text](imgs/finallpandas_.png) -->

# Introduction

Data processing and manipulation are crucial skills in data science and machine learning. In this lecture, we will cover the main functionality and common use-cases of NumPy, Pandas, Huggingface Datasets, and working with APIs.

# NumPy Basics
NumPy (Numerical Python) is a library that provides support for arrays, matrices, and many mathematical functions. It is widely used in data processing, scientific computing, and machine learning.

![Alt text](imgs/what-is-numpy.png)

In [3]:
# Installing the numpy library (uncomment the line below if not already installed)
# !pip install numpy

# Importing NumPy library
import numpy as np

### Creating Arrays
We can create arrays using various functions like `np.array()`, `np.zeros()`, `np.ones()`, `np.arange()`, and `np.linspace()`.


In [5]:
# Creating an array from a list
arr = np.array([1, 2, 3, 4, 5])
print("Array:", arr)

Array: [1 2 3 4 5]


In [6]:
# Creating an array of zeros
zeros_arr = np.zeros((3, 3))
print("Zeros Array:\n", zeros_arr)


Zeros Array:
 [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [8]:
# Creating an array of ones
ones_arr = np.ones((2, 4))
print("Ones Array:\n", ones_arr)

Ones Array:
 [[1. 1. 1. 1.]
 [1. 1. 1. 1.]]


In [9]:
# Creating an array with a range of values
range_arr = np.arange(0, 10, 2)
print("Range Array:", range_arr)

Range Array: [0 2 4 6 8]


In [10]:
# Creating an array with linearly spaced values
linspace_arr = np.linspace(0, 1, 5)
print("Linspace Array:", linspace_arr)

Linspace Array: [0.   0.25 0.5  0.75 1.  ]


### Array Operations
NumPy allows us to perform element-wise operations on arrays, such as addition, subtraction, multiplication, and division


In [13]:
# Element-wise operations
arr = np.array([1, 2, 3, 4, 5])
arr2 = arr * 2
print("Original Array:", arr)
print("Array after multiplication by 2:", arr2)

Original Array: [1 2 3 4 5]
Array after multiplication by 2: [ 2  4  6  8 10]


In [14]:
# Adding two arrays
arr3 = arr + arr2
print("Sum of Arrays:", arr3)

Sum of Arrays: [ 3  6  9 12 15]


In [15]:
# Other operations
print("Array after subtraction:", arr2 - arr)
print("Array after division:", arr2 / arr)

Array after subtraction: [1 2 3 4 5]
Array after division: [2. 2. 2. 2. 2.]


### Mathematical Functions
NumPy provides many mathematical functions to perform operations like mean, standard deviation, sum, dot product, etc


In [16]:
# Array for mathematical operations
arr = np.array([1, 2, 3, 4, 5])

# Mean
print("Mean:", np.mean(arr))

# Standard Deviation
print("Standard Deviation:", np.std(arr))

# Sum
print("Sum:", np.sum(arr))

# Dot Product
arr2 = np.array([6, 7, 8, 9, 10])
print("Dot Product:", np.dot(arr, arr2))

Mean: 3.0
Standard Deviation: 1.4142135623730951
Sum: 15
Dot Product: 130


### Broadcasting and Vectorization
NumPy's broadcasting and vectorization features allow us to perform operations on arrays of different shapes and sizes efficiently.`

In [17]:
# Broadcasting
arr = np.array([1, 2, 3, 4, 5])
broadcast_arr = arr + np.array([10, 20, 30, 40, 50])
print("Broadcasting Example:", broadcast_arr)

# Vectorization
vectorized_sum = np.sum(arr)
print("Vectorized Sum:", vectorized_sum)

Broadcasting Example: [11 22 33 44 55]
Vectorized Sum: 15


### Summary
In this notebook, we covered:
- Creating arrays using different functions.
- Performing basic array operations.
- Indexing, slicing, and reshaping arrays.
- Using mathematical functions.
- Understanding broadcasting and vectorization.

NumPy is a powerful library that forms the foundation for many data processing and scientific computing tasks. With these basics, you can start leveraging NumPy for your pojects.




## Introduction to Pandas
Pandas is a library that provides data structures and data analysis tools for Python. It is widely used for data cleaning, preparation, and analysis.


![Alt text](imgs/__.png)

In [25]:
# Installing the pandas library (uncomment the line below if not already installed)
# !pip install pandas

# Importing Pandas library
import pandas as pd

### Creating DataFrames
We can create DataFrames using various methods like `pd.DataFrame()`, `pd.read_csv()`, and `pd.read_excel()`


In [26]:
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print("DataFrame:\n", df)

# Creating a DataFrame from a CSV file
# Uncomment the line below if you have a CSV file to read
# df_csv = pd.read_csv('data.csv')
# print("DataFrame from CSV:\n", df_csv)

DataFrame:
       Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


### Data Selection and Filtering
We can select and filter data using methods like `.loc[]` and `.iloc[]`


In [22]:
# Selecting a single column
print("Name Column:\n", df['Name'])

# Selecting multiple columns
print("\nName and Age Columns:\n", df[['Name', 'Age']])

# Filtering rows based on a condition
filtered_df = df[df['Age'] > 30]
print("\nFiltered DataFrame (Age > 30):\n", filtered_df)

### Data Manipulation
We can add or remove columns, handle missing values, and perform various other data manipulation tasks


In [24]:
# Adding a new column
df['City'] = ['New York', 'Los Angeles', 'Chicago']
print("DataFrame with New Column:\n", df)

# Removing a column
df.drop('City', axis=1, inplace=True)
print("\nDataFrame after Removing Column:\n", df)

# Handling missing values
df_with_nan = df.copy()
df_with_nan.loc[1, 'Age'] = None
print("\nDataFrame with Missing Value:\n", df_with_nan)

# Filling missing values
df_filled = df_with_nan.fillna(0)
print("\nDataFrame after Filling Missing Values:\n", df_filled)

# Dropping rows with missing values
df_dropped = df_with_nan.dropna()
print("\nDataFrame after Dropping Missing Values:\n", df_dropped)

### Grouping and Aggregation
We can group data and perform aggregation operations using `.groupby()` and `.agg()`


In [None]:
# Grouping data
grouped_df = df.groupby('Age').count()
print("Grouped DataFrame:\n", grouped_df)

# Aggregation
agg_df = df.groupby('Age').agg({'Name': 'count', 'Age': 'mean'})
print("\nAggregated DataFrame:\n", agg_df)

### Merging and Joining Data
We can merge and join data using `pd.merge()` and `pd.concat()`


In [None]:
# Merging DataFrames
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Age': [25, 30]})
merged_df = pd.merge(df1, df2, on='ID')
print("Merged DataFrame:\n", merged_df)

# Concatenating DataFrames
df3 = pd.DataFrame({'ID': [3], 'Name': ['Charlie'], 'Age': [35]})
concat_df = pd.concat([merged_df, df3])
print("\nConcatenated DataFrame:\n", concat_df)

In [None]:
# Merging DataFrames
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Age': [25, 30]})
merged_df = pd.merge(df1, df2, on='ID')
print("Merged DataFrame:\n", merged_df)

# Concatenating DataFrames
df3 = pd.DataFrame({'ID': [3], 'Name': ['Charlie'], 'Age': [35]})
concat_df = pd.concat([merged_df, df3])
print("\nConcatenated DataFrame:\n", concat_df)

### Summary
In this notebook, we covered:
- Creating DataFrames using different methods.
- Selecting and filtering data.
- Performing data manipulation tasks.
- Grouping and aggregating data.
- Merging and joining DataFrames.

Pandas is a versatile library that provides powerful data manipulation and analysis tools. With these basics, you can start leveraging Pandas for your data pojects.


# Huggingface Datasets Basics

In this notebook, we will explore the basics of the Huggingface Datasets library, which provides a standardized interface for accessing and processing large datasets used in Natural Language Processing (NLP).

## Introduction to Huggingface Datasets
The Huggingface Datasets library is a powerful tool for loading, processing, and managing datasets, especially in the context of NLP. It provides a simple and efficient way to handle large-scale dataets.


In [27]:
# Installing the datasets library (uncomment the line below if not already installed)
# !pip install datasets

from datasets import load_dataset

ModuleNotFoundError: No module named 'datasets'

## Loading Datasets
We can load datasets using the `load_dataset` function. The library includes many pre-configured datasets


In [5]:
# Loading the IMDB dataset
dataset = load_dataset('imdb')
print("IMDB Dataset:\n", dataset)

IMDB Dataset:
 DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [7]:
print(type(dataset))
print(type(dataset['train']))

<class 'datasets.dataset_dict.DatasetDict'>
<class 'datasets.arrow_dataset.Dataset'>


## Exploring the Dataset
We can explore the dataset to understand its structure, including the features and examples


In [None]:
# Exploring the structure of the dataset
print("Dataset Features:\n", dataset['train'].features)

# Viewing an example from the training set
print("\nExample from the Training Set:\n", dataset['train'][0])

## Splitting the Dataset
Datasets can be split into training, validation, and test sets


In [None]:
# Splitting the dataset into training and test sets
train_test_split = dataset['train'].train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

print("Training Set Size:", len(train_dataset))
print("Test Set Size:", len(test_dataset))

## Selecting and Filtering Data
We can select specific subsets of the data or filter based on certain conditions


In [None]:
# Selecting the first 5 examples from the training set
subset = train_dataset.select([0, 1, 2, 3, 4])
print("Subset of Training Set:\n", subset)

In [None]:
def filter_func(example: dict) -> dict:
    return example['label'] == 1

# Filtering examples with a specific condition
filtered_dataset = train_dataset.filter(lambda x: filter_func(x))
print("\nFiltered Dataset (label == 1):\n", filtered_dataset)

## Applying Transformations
We can apply transformations to the dataset, such as tokenization or data augmentation


In [None]:
def split_text_into_words(example: dict) -> dict:
    example['words'] = example['text'].split(' ')
    return example

# Apply the split_text_into function to the dataset
processed_dataset = train_dataset.map(split_text_into_words)
print("Tokenized Dataset Example:\n", processed_dataset[0])

## Splitting the Dataset
Datasets can be split into training, validation, and test sets


In [None]:
# Splitting the dataset into training and test sets
train_test_split = dataset['train'].train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

print("Training Set Size:", len(train_dataset))
print("Test Set Size:", len(test_dataset))

## Multiprocessing
We can leverage multiprocessing to speed up data processing tasks


In [None]:
# Using multiprocessing with map
import multiprocessing

# Number of processes
num_proc = multiprocessing.cpu_count()
print(f'{num_proc} processess could be run in that computer simoultaneously')

## Conver to Pandas

Padnas DataFrame & Huggingface Dataset ყველაზე პოპულარული ბიბლიოთეკებია, ამიტომ მარტივად არის შესაძლებელი მონაცემების ერთი ფორმატიდან მეორეში გადატანა.

In [10]:
from datasets import Dataset, load_dataset

# ჩავტვირთოთ მონაცემები Dataset კლასში
dataset = load_dataset('imdb', split='train')
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [11]:
# Huggingface Dataset -> Pandas DataFrame
df = dataset.to_pandas()
df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [12]:
# Pandas DataFrame -> Huggingface Dataset
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

## Summary
In this notebook, we covered:
- Loading datasets using Huggingface Datasets.
- Shuffling the dataset.
- Filtering data based on specific conditions.
- Applying transformations using the `map` function.
- Splitting the dataset into training and test sets.
- Removing columns from the dataset.
- Using multiprocessing to speed up data processing tasks.

The Huggingface Datasets library provides a streamlined and efficient workflow for handling large-scale datasets. With these basics, you can start leveraging this powerful tool for your data processng needs.


## Introduction to APIs
APIs provide a way for different applications to communicate with each other. We will use the `requests` library to send HTTP requests to APIs and handle their responses.

## Importing Libraries
First, we need to import the necessary libraries.


In [39]:
# Importing the requests library
import requests
from pprint import PrettyPrinter
pp = PrettyPrinter()

## Making a GET Request
A GET request is used to retrieve data from a server


In [67]:
# Making a GET request to a sample API
response = requests.get('https://api.coindesk.com/v1/bpi/currentprice/BTC.json')

# Checking the status code of the response
print("Status Code:", response.status_code)

# Parsing the JSON response
data = response.json()
print("Response JSON:")
pp.pprint(data)

Status Code: 200
Response JSON:
{'bpi': {'BTC': {'code': 'BTC',
                 'description': 'Bitcoin',
                 'rate': '1.0000',
                 'rate_float': 1},
         'USD': {'code': 'USD',
                 'description': 'United States Dollar',
                 'rate': '70,851.342',
                 'rate_float': 70851.3421}},
 'disclaimer': 'This data was produced from the CoinDesk Bitcoin Price Index '
               '(USD). Non-USD currency data converted using hourly conversion '
               'rate from openexchangerates.org',
 'time': {'updated': 'Jun 5, 2024 19:28:27 UTC',
          'updatedISO': '2024-06-05T19:28:27+00:00',
          'updateduk': 'Jun 5, 2024 at 20:28 BST'}}


## Making a POST Request
A POST request is used to send data to a server.

In [64]:
import requests

data = {
    'key': 'value'
}

# The API endpoint to communicate with
url_post = "https://jsonplaceholder.typicode.com/posts"

# A POST request to tthe API
post_response = requests.post('https://httpbin.org/post', data = {'key':'value'})

# Checking the status code of the response
print("Status Code:", post_response.status_code)

print(post_response.text)

Status Code: 200
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key": "value"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "9", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-6660bc46-58ccee3d32f29e0a781487a3"
  }, 
  "json": null, 
  "origin": "134.19.243.33", 
  "url": "https://httpbin.org/post"
}

