# Lecture 2 - DSAI510 Data Science

# Data Types in Python

Understanding data types is crucial in data science for efficient data processing, analysis, and modeling. Data in Python can be classified into:
- Numeric (int or float)
- Categorical
- DateTime
- String (text)
- Boolean (True or False)

Numbers in some context may be a categorical value rather than numeric value. For example, age is a numeric data type because it represents a quantifiable amount with inherent ordinality (e.g., 25 years is older than 24 years). However, numbers like 1, 2, 3, ... written on boxes to label them do not represent a quantifiable amount or inherent ordinality in the context of the boxes' contents. In this scenario, even though the labels are numbers, they act as identifiers or names for the boxes. Thus, these numbers 1, 2, 3, ... on the boxes are of a categorical type. The key distinction here is the context in which the numbers are used. In the case of age, the numbers have a clear ordinal and quantitative meaning. In the case of box labels, the numbers are merely identifiers and don't imply any ordinality or quantity about the boxes' contents. 

# Data Structures in Python: 

Data structures are containers that hold multiple data items, possibly of diverse types. 
- List
- Dictionary
- Some other we won't use much

In [None]:
# Integers and Floats
int_value = 5
float_value = 5.5

print(f"Integer value: {int_value}")
print(f"Float value: {float_value}")

In [None]:
# Strings
string_value = "Data Science"
print(string_value)
print(string_value.upper())
print(string_value.lower())

In [None]:
# Boolean
cat_is_animal = True
print(cat_is_animal)

In [None]:
print(type(int_value))
print(type(float_value))
print(type(string_value))
print(type(cat_is_animal))

In [None]:
# Categorical Data
import pandas as pd

# Sample data where the category column is string (not real category yet)
data = {'Category': ['A', 'B', 'A', 'C'], 'Size': [1,1,5,10]}
df = pd.DataFrame(data)
df

In [None]:
print(df['Category'].dtype)
print(df['Size'].dtype)

# Category column is not really categorical yet.

In [None]:
# Convert to Category column categorical type (this will be needed for machine learning)
df['Category'] = df['Category'].astype('category')
print(df['Category'].dtype)

In [None]:
# DateTime operations with pandas
date_series = pd.to_datetime(pd.Series(['2023-09-15', '2023-02-10', '2023-03-05']), format='%Y-%m-%d')
print(date_series.dt.month)

In [None]:
list_example = [1, 2, 3, 4, 5]
type(list_example)

In [None]:
dict_example = {'Ali': 123, 'Veli': 999, 'Zeki': 444}
print(dict_example)
type(dict_example)

In [None]:
dict_example['Zeki']

In [None]:
import pandas as pd

df_example = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
print(df_example)

# Python datasets

## PyDataset

In [None]:
# The pydataset package  is not a standard package that ships with Python. So I need to install it manually
# Below "!" is to send a command to your operating system (MacOs, Linux or Windows) from within Python
!pip install pydataset

In [None]:
# Import package
from pydataset import data
# List available datasets
data()

In [None]:
import pandas as pd
pd.set_option('display.show_dimensions', True)
# Set the display option to show all rows
pd.set_option('display.max_rows', None)

# List available datasets
display(data())

In [None]:
# Turn off "show all rows" as it will make your notebook too crowded for large datasets later
pd.reset_option('display.max_rows')

display(data())



In [None]:
# Load as a dataframe
df = data('Titanic')

df

In [None]:
# Let's write a function called glimpse to automate last three lines.
def glimpse(df):
    print(f"{df.shape[0]} rows and {df.shape[1]} columns")
    display(df.head())
    display(df.tail())
    

df = data("Titanic")
glimpse(df)

## Seaborn

In [None]:
import seaborn as sns
print(sns.get_dataset_names()) 

In [None]:
# Load as a dataframe
df = sns.load_dataset('tips')

# total_bill is the bill of the table. Size is the table size, like for how many people.
glimpse(df)

# Scikit-learn

In [None]:
from sklearn import datasets

# List attributes with Python command dir()
dir(datasets)

# Below, functions starting with load_ typically provide small 
# standard datasets that come bundled with the library
# while functions starting with fetch_ will download 
# larger datasets from the internet the first time you call them.

In [None]:
from sklearn.datasets import load_diabetes

# Load the breast_cancer dataset
diabetesdata = load_diabetes()

print(diabetesdata)

In [None]:
diabetesdata['data']  # or diabetesdata.data

In [None]:
# Convert the dataset to a DataFrame
df = pd.DataFrame(diabetesdata.data, columns=diabetesdata.feature_names)
df

# The target variable which quantifies the disease progression is stored in dataset.target. 
# Let's add it to the dataframe.

In [None]:
# Add the target variable to the DataFrame. This will first create a column named "target"
df['target'] = diabetesdata.target

# Display the first few rows of the DataFrame
df.head()

# Natural Language Toolkit | NLTK

List of available datasets are given in
https://www.nltk.org/nltk_data/

In [None]:
import nltk
nltk.download('movie_reviews')

from nltk.corpus import movie_reviews

In [None]:
movie_reviews

#CategorizedPlaintextCorpusReader is a class that provides methods to access and work with text corpora that are organized into categories. 
# It allows you to handle and explore text data easily, especially when the data is categorized (like the movie reviews dataset, 
# which is divided into ‘pos’ (positive) and ‘neg’ (negative) categories).

In [None]:
# Get the list of file ids for positive and negative reviews
positive_fileids = movie_reviews.fileids(categories='pos')
negative_fileids = movie_reviews.fileids(categories='neg')

# Sample: Display the content of the first positive review and its label
sample_positive_review = " ".join(movie_reviews.words(fileids=positive_fileids[0]))
print("Review (Positive):\n", sample_positive_review[:500], "...")  
# Displaying the first 500 characters for brevity

# Sample: Display the content of the first negative review and its label
sample_negative_review = " ".join(movie_reviews.words(fileids=negative_fileids[5]))
print("\nReview (Negative):\n", sample_negative_review[:500], "...")  
# Displaying the first 500 characters for brevity


## Statmodels

In [None]:
import statsmodels.api as sm

# Load the 'fair' dataset as an example
fair = sm.datasets.fair.load_pandas().data
print(fair.head())

occupation: The occupation of the respondent, coded from 1 to 7. This code represents the following levels of occupation:
	•	1 = Student
	•	2 = Farming/Semi-Skilled/Unskilled
	•	3 = Clerical/Sales/Skilled Manual
	•	4 = Professional/Managerial
	•	5 = Other

affairs: The number of extramarital affairs reported by the respondent in the past year. This is a continuous variable representing the number of affairs.


## quandl for financial data via API

We will pull financial data from the internet by using quandl API. It's free (with rate limits) but first you need to sign up to get your own API key. Here's the link to sign up via NASDAQ page: https://data.nasdaq.com/sign-up   After signing up, your personal API key will display on the screen; it's some gibberish like vnslk_F9502na. Don't share it with anyone. If you miss your API key, you can find it here later: https://data.nasdaq.com/account/profile

Available datasets can be reached here: https://data.nasdaq.com/search

In [None]:
# Install pip first
!pip install quandl

In [None]:
import quandl
from keys import quandl_api_key

#myAPIkey= "lolo"   # uncomment this line and replace lolo with your own API key 

quandl.ApiConfig.api_key = quandl_api_key

# Fetch Gold price data
data = quandl.get("LBMA/GOLD")

# Display the first few rows
display(data.head())
display(data.tail())

: 

## Using Web APIs

In [None]:
# This example doesn't require API KEY. 

import requests

# Define the API endpoint
country = "turkey"
url = f"https://restcountries.com/v3.1/name/{country}"

# Make the HTTP GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON data
    data = response.json()[0]  # The API returns a list, so we take the first item
    
    # Extract and display relevant country data
    name = data["name"]["common"]
    capital = data["capital"][0]
    population = data["population"]
    area = data["area"]
    currencies = ", ".join([currency for currency in data["currencies"]])
    
    print(f"Information about {name}:")
    print(f"Capital: {capital}")
    print(f"Population: {population}")
    print(f"Area: {area} sq.km")
    print(f"Currencies: {currencies}")
else:
    print(f"Failed to retrieve data. HTTP Status Code: {response.status_code}")



## Tensorflow

In [None]:
!pip install tensorflow_datasets

In [None]:
import tensorflow_datasets as tfds

# List all available datasets
datasets_list = tfds.list_builders()

print(datasets_list)

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

# Load the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

# Function to display a grid of images
def display_images(images, labels, num_rows=2, num_cols=5):
    """Displays a grid of images."""
    plt.figure(figsize=(10, 4))
    for i in range(num_rows * num_cols):
        plt.subplot(num_rows, num_cols, i + 1)
        plt.imshow(images[i], cmap='gray')
        plt.title(f"Label: {labels[i]}")
        plt.axis('off')
    plt.tight_layout()
    plt.show()

# Display the first few images from the training set
display_images(train_images, train_labels)



# Loading data from files

In [None]:
import pandas as pd

houses = pd.read_csv("houses.csv")
houses

# If the data is separated by ";", then use pd.read_csv("hotels.csv", delimiter=';') 

"Total" is total market value of the house. This value usually combines the worth of both the land and the building on it.

# DATABASES: PostgreSQL Example

In [None]:
!pip install pandas sqlalchemy psycopg2

# psycopg2 is a popular PostgreSQL database adapter (library) for Python. It allows you to connect to a PostgreSQL database 
# from your Python application and perform various database operations such as querying, inserting, updating, and deleting data.


Before the following steps, you need to activate postgres server in the background by using Terminal (MacOS or Linux) or command line in Windows.
Then you need to import dvdrental.tar database into your postgre server.

In [None]:
import pandas as pd
from sqlalchemy import create_engine

username = "postgres"  # this is usually the default username
password = "foo"   # replace foo with your password

# Create an engine instance
engine = create_engine(f'postgresql://{username}:{password}@localhost:5432/dvdrental')

try:
    connection = engine.connect()
    print("Successfully connected to the dvdrental database!")
except Exception as e:
    print(f"Error: {e}")
finally:
    connection.close()


In [None]:
query_tables = "SELECT table_name FROM information_schema.tables WHERE table_schema='public';"
tables = pd.read_sql(query_tables, engine)
print(tables)

In [None]:
# Connect to the database and retrieve data
query = "SELECT * FROM film LIMIT 5;"  # This retrieves the first 5 rows from the 'film' table
df = pd.read_sql(query, engine)

# Display the DataFrame
df


In [None]:
# Connect to the database and retrieve data
query = "SELECT * FROM language LIMIT 5;"  # This retrieves the first 5 rows from the 'film' table
df = pd.read_sql(query, engine)

# Display the DataFrame
df