# Setting up a Data Science Environment

This notebook provides a comprehensive guide to setting up and configuring a robust data science environment. We'll cover various tools, package management systems, IDEs, and best practices to create an efficient workflow for data science projects.

## 1. Understanding Data Science Environment Requirements

A good data science environment should have:

- **Python installation**: The core programming language
- **Package management**: For installing and managing libraries
- **Development environment**: IDEs or notebooks for coding
- **Essential libraries**: NumPy, pandas, scikit-learn, matplotlib, etc.
- **Environment isolation**: To manage dependencies across projects
- **Version control**: To track changes and collaborate

## 2. Python Installation Options

### Option 1: Anaconda Distribution (Recommended for beginners)

Anaconda is a comprehensive distribution that includes Python and many data science packages.

**Installation steps:**
1. Download Anaconda from [https://www.anaconda.com/products/distribution](https://www.anaconda.com/products/distribution)
2. Run the installer and follow the instructions
3. Verify installation by opening Anaconda Navigator or running `conda --version` in terminal

### Option 2: Miniconda (Lightweight alternative)

Miniconda provides a minimal installation with just Python and conda.

**Installation steps:**
1. Download Miniconda from [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html)
2. Run the installer and follow the instructions
3. Verify with `conda --version`

### Option 3: Standard Python Installation

**Installation steps:**
1. Download from [https://www.python.org/downloads/](https://www.python.org/downloads/)
2. Install and verify with `python --version`
3. Use pip for package management: `pip --version`

## 3. Environment Management

### Using Conda Environments

Conda environments help isolate project dependencies.

In [None]:
# Create a new conda environment
# !conda create -n datasci python=3.9 -y

# Activate the environment (for Windows)
# !activate datasci

# Activate the environment (for Mac/Linux)
# !source activate datasci

# List available environments
# !conda env list

### Using Virtual Environments (venv)

If you're using standard Python installation, venv is the built-in solution.

In [None]:
# Create a virtual environment
# !python -m venv datasci_env

# Activate the environment (Windows)
# !datasci_env\Scripts\activate

# Activate the environment (Mac/Linux)
# !source datasci_env/bin/activate

## 4. Installing Essential Data Science Libraries

With your environment activated, install the core libraries for data science work.

In [None]:
# Using conda (recommended if using Anaconda/Miniconda)
# !conda install numpy pandas matplotlib scikit-learn scipy jupyter seaborn statsmodels -y

# Using pip (works with any Python installation)
# !pip install numpy pandas matplotlib scikit-learn scipy jupyter seaborn statsmodels

### Additional Useful Libraries

In [None]:
# Visualization libraries
# !pip install plotly bokeh

# Machine learning extensions
# !pip install xgboost lightgbm catboost

# Deep learning
# !pip install tensorflow torch

# For working with geographic data
# !pip install geopandas folium

# For natural language processing
# !pip install nltk spacy gensim

## 5. Setting Up Development Environments

### Jupyter Notebook and JupyterLab

Jupyter notebooks are excellent for data exploration and visualization. JupyterLab is a web-based interactive development environment that extends Jupyter notebook functionality.

In [None]:
# Install JupyterLab if not already installed
# !pip install jupyterlab

# Run JupyterLab (execute in terminal, not here)
# !jupyter lab

### Integrated Development Environments (IDEs)

For larger projects and production code, consider using a full-featured IDE:

1. **VS Code**
   - Download from: [https://code.visualstudio.com/](https://code.visualstudio.com/)
   - Install Python extension
   - Install Jupyter extension

2. **PyCharm**
   - Download from: [https://www.jetbrains.com/pycharm/](https://www.jetbrains.com/pycharm/)
   - Community Edition is free, Professional has more data science features

3. **Spyder** (included in Anaconda)
   - Scientific Python Development Environment focused on data analysis

## 6. Version Control with Git

Git helps track changes to your code and collaborate with others.

**Installation steps:**
1. Download from [https://git-scm.com/downloads](https://git-scm.com/downloads)
2. Install and verify with `git --version`

In [None]:
# Basic git setup
# !git config --global user.name "Your Name"
# !git config --global user.email "your.email@example.com"

# Initialize a new repository
# !git init

# Create .gitignore for data science projects
# !echo ".ipynb_checkpoints\n*.csv\n*.xlsx\n__pycache__/\ndata/\nmodels/" > .gitignore

## 7. Setting Up a Project Structure

Organizing your data science projects properly is crucial for maintainability and reproducibility.

In [None]:
# Example script to create a data science project structure
# !mkdir -p project_name/data/{raw,processed,external}
# !mkdir -p project_name/notebooks
# !mkdir -p project_name/src/{data,features,models,visualization}
# !mkdir -p project_name/models
# !mkdir -p project_name/reports/figures
# !touch project_name/README.md
# !touch project_name/requirements.txt

The structure looks like this:

```
project_name/
│
├── data/               # Data files
│   ├── raw/            # Original immutable data
│   ├── processed/      # Cleaned and processed data
│   └── external/       # Data from third party sources
│
├── notebooks/          # Jupyter notebooks for exploration
│
├── src/                # Source code (Python modules)
│   ├── data/           # Data processing scripts
│   ├── features/       # Feature engineering scripts
│   ├── models/         # Model training and prediction scripts
│   └── visualization/  # Visualization scripts
│
├── models/             # Trained models and model predictions
│
├── reports/            # Reports and presentations
│   └── figures/        # Generated graphics and figures
│
├── README.md           # Project description
└── requirements.txt    # Dependencies
```

## 8. Creating a Requirements File

To make your project reproducible, track your dependencies in a requirements file.

In [None]:
# Generate requirements.txt from your current environment
# !pip freeze > requirements.txt

# If using conda
# !conda list --export > requirements.txt

## 9. Setting Up Environment Variables

Environment variables help manage configuration and secrets without hardcoding them in your scripts.

In [None]:
# Create a .env file template (never commit actual secrets to version control)
# !echo "API_KEY=your_api_key_here\nDATABASE_URL=your_db_connection_string" > .env.example

# Install python-dotenv to work with .env files
# !pip install python-dotenv

Example of using environment variables in Python:

In [None]:
# Example code for using environment variables
import os
from dotenv import load_dotenv

# Load environment variables from .env file
# load_dotenv()

# Access environment variables
# api_key = os.environ.get('API_KEY')
# database_url = os.environ.get('DATABASE_URL')

## 10. Setting Up Jupyter Extensions

Jupyter extensions can significantly improve your workflow.

In [None]:
# Install Jupyter extensions
# !pip install jupyter_contrib_nbextensions
# !jupyter contrib nbextension install --user

# Enable popular extensions
# !jupyter nbextension enable toc2/main
# !jupyter nbextension enable collapsible_headings/main
# !jupyter nbextension enable code_prettify/code_prettify

## 11. Database Connections

Many data science projects require connecting to databases.

In [None]:
# Install database connection libraries
# !pip install sqlalchemy psycopg2-binary pymysql pymongo

Example of connecting to a SQL database:

In [None]:
# Example: SQLite connection
import sqlite3
import pandas as pd

# Create a connection to a SQLite database (or use an existing one)
# conn = sqlite3.connect('example.db')

# Example: create a table and insert data
# conn.execute('''
#     CREATE TABLE IF NOT EXISTS example_table (
#         id INTEGER PRIMARY KEY,
#         name TEXT,
#         value REAL
#     )
# ''')
# conn.commit()

# Use pandas to interact with SQL
# df = pd.read_sql("SELECT * FROM example_table", conn)

## 12. Testing Your Environment

Let's run a simple test to make sure your data science environment is set up correctly.

In [None]:
# Test importing key libraries
try:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn import datasets
    
    print("NumPy version:", np.__version__)
    print("Pandas version:", pd.__version__)
    print("Matplotlib version:", plt.matplotlib.__version__)
    print("Seaborn version:", sns.__version__)
    print("Scikit-learn version:", datasets.__version__)
    print("\nAll libraries imported successfully!")
except ImportError as e:
    print(f"Error importing libraries: {e}")

In [None]:
# Simple data science operation test
# Load a dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Create a simple plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', 
                hue='target', data=df, palette='viridis')
plt.title('Iris Dataset: Sepal Width vs Sepal Length')
plt.show()

print("Your data science environment is working correctly!")

## 13. Environment Sharing and Reproducibility

For collaboration, you'll want to share your environment configuration.

In [None]:
# Create a conda environment file
# !conda env export > environment.yml

# Another person can recreate your environment with:
# !conda env create -f environment.yml

## 14. Next Steps and Best Practices

1. **Keep environments minimal**: Only install what you need
2. **Document dependencies**: Always maintain requirements.txt or environment.yml
3. **Virtual environments**: Create separate environments for different projects
4. **Version control**: Track code changes with git
5. **Data versioning**: Consider tools like DVC (Data Version Control)
6. **Regular updates**: Update packages periodically but be careful of breaking changes
7. **Containerization**: Consider Docker for full environment reproducibility

## 15. Conclusion

You've now set up a complete data science environment with:

- Python installation
- Package and environment management
- Essential data science libraries
- Development tools and IDEs
- Version control
- Project organization structure
- Database connections
- Reproducibility tools

This foundation will allow you to efficiently work on data science projects while following best practices for code organization, reproducibility, and collaboration.