# Best Online Datasets for AI Domains

This notebook explores valuable online datasets for different AI domains, including:
- Data Science
- Machine Learning
- Data Visualization
- Deep Learning
- Python Programming

We'll walk through dataset sources, examples, and how to access them for your projects.

## Setup Environment

First, let's install and import the necessary libraries for handling and exploring datasets.

In [None]:
# Install necessary packages if needed
# Uncomment and run if required
# !pip install pandas numpy matplotlib seaborn requests plotly kaggle scikit-learn tensorflow

# Import basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
import os
import io
from zipfile import ZipFile
from urllib.request import urlopen
import plotly.express as px

# Configure displays
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-whitegrid')
sns.set_theme(style="whitegrid")

print("Setup completed!")

## Data Science (DS) Datasets

Data Science datasets are fundamental for statistical analysis, data cleaning, and exploratory data analysis. Here are some excellent sources and examples:

In [None]:
# Create a function to display dataset sources
def display_data_sources(domain, sources):
    """Display dataset sources for a specific AI domain"""
    print(f"=== {domain} Dataset Sources ===")
    for source in sources:
        print(f"- {source['name']}: {source['url']}")
        print(f"  Description: {source['description']}")
    print("\n")

# Data Science dataset sources
ds_sources = [
    {
        "name": "Kaggle",
        "url": "https://www.kaggle.com/datasets",
        "description": "Thousands of datasets for various data science problems."
    },
    {
        "name": "UCI Machine Learning Repository",
        "url": "https://archive.ics.uci.edu/ml/index.php",
        "description": "Collection of databases, domain theories, and data generators used for empirical analysis of machine learning algorithms."
    },
    {
        "name": "data.gov",
        "url": "https://www.data.gov/",
        "description": "US government's open data portal with over 200,000 datasets."
    },
    {
        "name": "Google Dataset Search",
        "url": "https://datasetsearch.research.google.com/",
        "description": "Search engine for datasets."
    },
    {
        "name": "AWS Open Data Registry",
        "url": "https://registry.opendata.aws/",
        "description": "Public datasets available through AWS resources."
    }
]

display_data_sources("Data Science", ds_sources)

### Example: Exploring a Data Science Dataset

Let's explore the Iris dataset, a classic dataset for data science practice:

In [None]:
# Load the Iris dataset from seaborn
from sklearn.datasets import load_iris

# Load the dataset
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = [iris.target_names[i] for i in iris.target]

# Display the first few rows
print("Iris Dataset Preview:")
iris_df.head()

In [None]:
# Basic exploratory data analysis
print("Dataset Shape:", iris_df.shape)
print("\nDataset Info:")
iris_df.info()

print("\nDescriptive Statistics:")
iris_df.describe()

In [None]:
# Visualization of the Iris dataset
plt.figure(figsize=(12, 6))

# Create a scatter plot
plt.subplot(1, 2, 1)
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', 
                hue='species', data=iris_df)
plt.title('Sepal Dimensions by Species')

plt.subplot(1, 2, 2)
sns.scatterplot(x='petal length (cm)', y='petal width (cm)', 
                hue='species', data=iris_df)
plt.title('Petal Dimensions by Species')

plt.tight_layout()
plt.show()

## Machine Learning (ML) Datasets

Machine Learning datasets are designed for training and testing ML models. These datasets are suitable for classification, regression, and clustering problems.

In [None]:
# Machine Learning dataset sources
ml_sources = [
    {
        "name": "Scikit-learn built-in datasets",
        "url": "https://scikit-learn.org/stable/datasets/index.html",
        "description": "Clean datasets for classification, regression, clustering, and manifold learning."
    },
    {
        "name": "OpenML",
        "url": "https://www.openml.org/",
        "description": "Open platform for sharing ML datasets, tasks, and experiments."
    },
    {
        "name": "TensorFlow Datasets",
        "url": "https://www.tensorflow.org/datasets",
        "description": "Collection of ready-to-use datasets for ML research."
    },
    {
        "name": "Quandl",
        "url": "https://www.quandl.com/",
        "description": "Financial, economic, and alternative datasets for ML in finance."
    },
    {
        "name": "MLData.org",
        "url": "http://mldata.org/",
        "description": "Repository for machine learning data."
    }
]

display_data_sources("Machine Learning", ml_sources)

### Example: Exploring a Machine Learning Dataset

Let's explore the Boston Housing dataset, commonly used for regression problems:

In [None]:
# Load Boston Housing dataset
from sklearn.datasets import load_boston
import warnings
warnings.filterwarnings('ignore')  # Suppress warnings

# Note: The Boston housing dataset has been removed from scikit-learn due to ethical concerns
# We're still including it for educational purposes
try:
    boston = load_boston()
    boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
    boston_df['PRICE'] = boston.target

    # Display dataset info
    print("Boston Housing Dataset - First 5 rows:")
    print(boston_df.head())
    print("\nFeatures description:")
    print(boston.DESCR[:1000] + "...")  # Display partial description
    
except:
    print("The Boston Housing dataset is no longer available in scikit-learn.")
    print("Let's use the California Housing dataset instead.")
    
    from sklearn.datasets import fetch_california_housing
    california = fetch_california_housing()
    california_df = pd.DataFrame(california.data, columns=california.feature_names)
    california_df['PRICE'] = california.target
    
    # Display dataset info
    print("California Housing Dataset - First 5 rows:")
    print(california_df.head())
    print("\nFeatures description:")
    print(california.DESCR[:1000] + "...")  # Display partial description

In [None]:
# Basic analysis of housing dataset
try:
    # For Boston dataset
    housing_df = boston_df
    target_feature = 'PRICE'
except:
    # For California dataset
    housing_df = california_df
    target_feature = 'PRICE'

# Statistical summary
print("Statistical Summary:")
housing_df.describe()

In [None]:
# Correlation analysis
plt.figure(figsize=(12, 10))
correlation_matrix = housing_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Scatter plot of most correlated feature with target
most_correlated = correlation_matrix[target_feature].drop(target_feature).abs().idxmax()
plt.figure(figsize=(10, 6))
sns.regplot(x=most_correlated, y=target_feature, data=housing_df)
plt.title(f'Relationship between {most_correlated} and {target_feature}')
plt.xlabel(most_correlated)
plt.ylabel(target_feature)
plt.show()

## Data Visualization (DV) Datasets

These datasets are particularly suited for creating compelling visualizations with temporal, geographical, and categorical features.

In [None]:
# Data Visualization dataset sources
dv_sources = [
    {
        "name": "Gapminder",
        "url": "https://www.gapminder.org/data/",
        "description": "Data on global development trends, perfect for temporal and geographical visualizations."
    },
    {
        "name": "Our World in Data",
        "url": "https://ourworldindata.org/",
        "description": "Research and data on global problems and their solutions."
    },
    {
        "name": "FiveThirtyEight",
        "url": "https://github.com/fivethirtyeight/data",
        "description": "Data and code behind FiveThirtyEight articles and graphics."
    },
    {
        "name": "The Pudding",
        "url": "https://github.com/the-pudding/data",
        "description": "Datasets from visual essays on The Pudding."
    },
    {
        "name": "Tableau Public Sample Data",
        "url": "https://public.tableau.com/en-us/s/resources",
        "description": "Curated datasets for creating visualizations."
    }
]

display_data_sources("Data Visualization", dv_sources)

### Example: Exploring a Data Visualization Dataset

Let's explore the Gapminder dataset, which is excellent for creating visualizations:

In [None]:
# Let's use the gapminder dataset from plotly express
try:
    gapminder = px.data.gapminder()
    print("Gapminder Dataset - First 5 rows:")
    display(gapminder.head())
except:
    # Alternative approach if plotly's built-in dataset is not available
    !pip install -q plotly
    import plotly.express as px
    gapminder = px.data.gapminder()
    print("Gapminder Dataset - First 5 rows:")
    display(gapminder.head())

In [None]:
# Basic information about the Gapminder dataset
print("Dataset Shape:", gapminder.shape)
print("\nCountries in the dataset:", gapminder['country'].nunique())
print("Years in the dataset:", gapminder['year'].unique())
print("Continents in the dataset:", gapminder['continent'].unique())

# Summary statistics
gapminder.describe()

In [None]:
# Create visualizations with the Gapminder dataset

# 1. GDP per capita vs life expectancy for 2007, colored by continent
fig1 = px.scatter(
    gapminder[gapminder['year'] == 2007], 
    x="gdpPercap", y="lifeExp", 
    size="pop", color="continent",
    hover_name="country", log_x=True,
    size_max=60,
    title="GDP per capita vs Life Expectancy (2007)"
)
fig1.show()

# 2. Population growth over time for each continent
fig2 = px.line(
    gapminder.groupby(['year', 'continent'])['pop'].sum().reset_index(),
    x="year", y="pop", color="continent",
    title="Population Growth by Continent (1952-2007)"
)
fig2.show()

# 3. Life expectancy over time by continent
fig3 = px.box(
    gapminder, 
    x="year", y="lifeExp", color="continent", 
    notched=True,
    title="Life Expectancy Distribution by Continent (1952-2007)"
)
fig3.show()

## Deep Learning (DL) Datasets

These datasets are specialized for training deep learning models, particularly in computer vision, natural language processing, and speech recognition.

In [None]:
# Deep Learning dataset sources
dl_sources = [
    {
        "name": "TensorFlow Datasets",
        "url": "https://www.tensorflow.org/datasets/catalog/overview",
        "description": "Collection of datasets ready to use with TensorFlow."
    },
    {
        "name": "Hugging Face Datasets",
        "url": "https://huggingface.co/datasets",
        "description": "Datasets for NLP tasks like text classification, question answering, etc."
    },
    {
        "name": "ImageNet",
        "url": "https://www.image-net.org/",
        "description": "Image database organized according to the WordNet hierarchy."
    },
    {
        "name": "COCO (Common Objects in Context)",
        "url": "https://cocodataset.org/",
        "description": "Dataset for object detection, segmentation, and captioning."
    },
    {
        "name": "AudioSet",
        "url": "https://research.google.com/audioset/",
        "description": "Large-scale dataset of manually annotated audio events."
    }
]

display_data_sources("Deep Learning", dl_sources)

### Example: Exploring a Deep Learning Dataset

Let's explore MNIST, a classic dataset for deep learning image classification:

In [None]:
# Load MNIST dataset
try:
    from tensorflow.keras.datasets import mnist
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    
    print("MNIST Dataset loaded successfully")
    print(f"Training data shape: {X_train.shape}")
    print(f"Training labels shape: {y_train.shape}")
    print(f"Test data shape: {X_test.shape}")
    print(f"Test labels shape: {y_test.shape}")
    
    # If TensorFlow is not available, let's try to use sklearn's version
except ImportError:
    print("TensorFlow not available, loading MNIST from scikit-learn...")
    from sklearn.datasets import fetch_openml
    
    # Load data from OpenML
    X, y = fetch_openml('mnist_784', version=1, return_X_y=True, parser='auto')
    X = np.array(X).reshape(-1, 28, 28)
    y = np.array(y, dtype=int)
    
    # Split the data
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    print("MNIST Dataset loaded successfully from scikit-learn")
    print(f"Training data shape: {X_train.shape}")
    print(f"Training labels shape: {y_train.shape}")
    print(f"Test data shape: {X_test.shape}")
    print(f"Test labels shape: {y_test.shape}")

In [None]:
# Display some examples from the MNIST dataset
plt.figure(figsize=(10, 5))
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(X_train[i], cmap='gray')
    plt.title(f"Label: {y_train[i]}")
    plt.axis('off')
plt.suptitle('MNIST Examples', fontsize=16)
plt.tight_layout()
plt.show()

# Print class distribution
unique_values, counts = np.unique(y_train, return_counts=True)
plt.figure(figsize=(10, 5))
plt.bar(unique_values, counts)
plt.xlabel('Digit')
plt.ylabel('Count')
plt.title('Distribution of Digits in MNIST Training Set')
plt.xticks(unique_values)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## Python Programming (PY) Datasets

These datasets are suitable for Python programming practice and learning data manipulation techniques.

In [None]:
# Python Programming friendly dataset sources
py_sources = [
    {
        "name": "Python's built-in datasets",
        "url": "https://docs.python.org/3/library/csv.html",
        "description": "Simple datasets that come with Python libraries like csv, sqlite3, etc."
    },
    {
        "name": "GitHub Repositories",
        "url": "https://github.com/awesomedata/awesome-public-datasets",
        "description": "Curated list of public datasets organized by topic."
    },
    {
        "name": "Public APIs",
        "url": "https://github.com/public-apis/public-apis",
        "description": "A collective list of free APIs for use in software and web development."
    },
    {
        "name": "Pandas Documentation Datasets",
        "url": "https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html",
        "description": "Datasets used in Pandas documentation examples."
    },
    {
        "name": "Seaborn Example Datasets",
        "url": "https://seaborn.pydata.org/generated/seaborn.load_dataset.html",
        "description": "Datasets built into the Seaborn visualization library."
    }
]

display_data_sources("Python Programming", py_sources)

### Example: Exploring a Python Programming Dataset

Let's explore a dataset from Seaborn that's excellent for Python programming practice:

In [None]:
# Load the Titanic dataset from Seaborn
try:
    titanic = sns.load_dataset('titanic')
    print("Titanic Dataset loaded successfully")
except:
    # Alternative: download directly from GitHub if seaborn is not available
    url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
    titanic = pd.read_csv(url)
    print("Titanic Dataset loaded successfully from GitHub")

# Display the first few rows
titanic.head()

In [None]:
# Basic information about the Titanic dataset
print("Dataset Shape:", titanic.shape)
print("\nBasic Information:")
titanic.info()

print("\nMissing Values:")
print(titanic.isna().sum())

print("\nSurvival Rate:")
survival_rate = titanic['survived'].mean() * 100
print(f"{survival_rate:.2f}% of passengers survived")

In [None]:
# Python data manipulation examples with the Titanic dataset

# Example 1: Groupby operations
print("Survival Rate by Class:")
class_survival = titanic.groupby('class')['survived'].mean().sort_values(ascending=False)
print(class_survival)

print("\nSurvival Rate by Gender:")
gender_survival = titanic.groupby('sex')['survived'].mean()
print(gender_survival)

print("\nSurvival Rate by Age Group:")
# Create age groups
titanic['age_group'] = pd.cut(titanic['age'], bins=[0, 12, 18, 35, 60, 100], 
                             labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior'])
age_survival = titanic.groupby('age_group')['survived'].mean().sort_values(ascending=False)
print(age_survival)

# Example 2: Data visualization
plt.figure(figsize=(15, 10))

# Survival by class
plt.subplot(2, 2, 1)
sns.barplot(x='class', y='survived', data=titanic)
plt.title('Survival Rate by Class')

# Survival by gender
plt.subplot(2, 2, 2)
sns.barplot(x='sex', y='survived', data=titanic)
plt.title('Survival Rate by Gender')

# Age distribution
plt.subplot(2, 2, 3)
sns.histplot(data=titanic, x='age', hue='survived', bins=30, multiple='stack')
plt.title('Age Distribution by Survival')

# Fare vs Age with survival
plt.subplot(2, 2, 4)
sns.scatterplot(x='age', y='fare', hue='survived', size='survived', 
                sizes={0: 20, 1: 60}, alpha=0.7, data=titanic)
plt.title('Fare vs Age with Survival')

plt.tight_layout()
plt.show()

## Accessing and Loading Datasets

Different data sources require different methods for accessing and loading datasets. Let's explore common approaches:

In [None]:
# Create a function to demonstrate different ways to load datasets
def demonstrate_data_loading():
    print("=== METHODS FOR LOADING DATASETS ===\n")
    
    # Method 1: Loading data from built-in libraries
    print("1. From Built-in Libraries:")
    print("```python")
    print("# Scikit-learn")
    print("from sklearn.datasets import load_iris")
    print("iris = load_iris()")
    print("iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)")
    print("\n# Seaborn")
    print("import seaborn as sns")
    print("titanic = sns.load_dataset('titanic')")
    print("```\n")
    
    # Method 2: Loading from URLs
    print("2. From URLs:")
    print("```python")
    print("# Pandas directly from URL")
    print("url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'")
    print("df = pd.read_csv(url)")
    print("```\n")
    
    # Method 3: From APIs
    print("3. From APIs:")
    print("```python")
    print("import requests")
    print("import json")
    print("\n# Example: Open Notify API (ISS location)")
    print("response = requests.get('http://api.open-notify.org/iss-now.json')")
    print("data = response.json()")
    print("```\n")
    
    # Method 4: Kaggle API
    print("4. From Kaggle API:")
    print("```python")
    print("# First, configure your Kaggle API credentials")
    print("# !kaggle datasets download -d USERNAME/DATASET-SLUG")
    print("\n# Example:")
    print("import kaggle")
    print("kaggle.api.authenticate()")
    print("kaggle.api.dataset_download_files('ronitf/heart-disease-uci', path='.', unzip=True)")
    print("df = pd.read_csv('heart.csv')")
    print("```\n")
    
    # Method 5: TensorFlow datasets
    print("5. From TensorFlow Datasets:")
    print("```python")
    print("import tensorflow_datasets as tfds")
    print("mnist, info = tfds.load('mnist', with_info=True, as_supervised=True)")
    print("train_dataset, test_dataset = mnist['train'], mnist['test']")
    print("```\n")
    
    # Method 6: Hugging Face datasets
    print("6. From Hugging Face Datasets:")
    print("```python")
    print("from datasets import load_dataset")
    print("dataset = load_dataset('glue', 'sst2')")
    print("train_dataset = dataset['train']")
    print("```\n")
    
    # Method 7: Database connections
    print("7. From Databases:")
    print("```python")
    print("import sqlite3")
    print("conn = sqlite3.connect('database.db')")
    print("df = pd.read_sql_query('SELECT * FROM table_name', conn)")
    print("conn.close()")
    print("```\n")

demonstrate_data_loading()

### Practical Example: Loading Data from an API

In [None]:
# Let's demonstrate loading data from a public API
try:
    # Using Open-Meteo API to get weather data for New York
    url = "https://api.open-meteo.com/v1/forecast?latitude=40.71&longitude=-74.01&daily=temperature_2m_max,temperature_2m_min,precipitation_sum&timezone=America%2FNew_York"
    
    response = requests.get(url)
    weather_data = response.json()
    
    # Convert to pandas dataframe
    if 'daily' in weather_data:
        daily_data = weather_data['daily']
        weather_df = pd.DataFrame({
            'date': daily_data['time'],
            'max_temp': daily_data['temperature_2m_max'],
            'min_temp': daily_data['temperature_2m_min'],
            'precipitation': daily_data['precipitation_sum']
        })
        
        print("New York Weather Forecast:")
        display(weather_df.head())
        
        # Simple visualization
        plt.figure(figsize=(12, 6))
        plt.plot(weather_df['date'], weather_df['max_temp'], 'r-', label='Max Temperature')
        plt.plot(weather_df['date'], weather_df['min_temp'], 'b-', label='Min Temperature')
        plt.fill_between(weather_df['date'], weather_df['min_temp'], weather_df['max_temp'], alpha=0.2)
        plt.bar(weather_df['date'], weather_df['precipitation'], alpha=0.3, color='blue', label='Precipitation')
        plt.title('New York Weather Forecast')
        plt.xlabel('Date')
        plt.ylabel('Temperature (°C) / Precipitation (mm)')
        plt.legend()
        plt.grid(True, linestyle='--', alpha=0.7)
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()
    else:
        print("Could not retrieve weather data.")
except Exception as e:
    print(f"Error accessing API: {e}")
    print("API connections may be restricted in this environment.")

## Summary

This notebook has explored a variety of high-quality datasets for different AI domains:

1. **Data Science Datasets**: We explored sources like Kaggle, UCI ML Repository, and data.gov for statistical analysis and EDA.

2. **Machine Learning Datasets**: We examined datasets from scikit-learn, OpenML, and other sources for training ML models.

3. **Data Visualization Datasets**: We used datasets from Gapminder and other sources that are excellent for creating compelling visualizations.

4. **Deep Learning Datasets**: We reviewed specialized datasets for computer vision, NLP, and other deep learning applications.

5. **Python Programming Datasets**: We explored datasets that are useful for practicing Python data manipulation skills.

Additionally, we demonstrated various methods for accessing and loading datasets from different sources, from built-in libraries to APIs.

These datasets and resources should provide a solid foundation for any AI, machine learning, or data science project you undertake.