---

# **🐼Pandas Cheatsheet**

---

### 🧠 1. DataFrame & Series Creation  
📌 Create structured data containers  
🔧  
- `pd.Series(data)` 📍 – 1D labeled array  
- `pd.DataFrame(data)` 📋 – 2D labeled table  
- `pd.read_csv()` 📄 – Read CSV  
- `pd.read_excel()` 📊 – Read Excel  
- `pd.read_json()` 📦 – Read JSON  
- `df.to_csv()` 💾 – Save to CSV  

---

### 🧠 2. Inspection & Info  
📌 Explore structure, types, and data summary  
🔧  
- `df.head(n)` 👀 – First n rows  
- `df.tail(n)` 🔚 – Last n rows  
- `df.info()` ℹ️ – Summary info  
- `df.describe()` 📐 – Statistical summary  
- `df.shape / df.columns / df.index` 📏 – Structure details  
- `df.dtypes` 🧬 – Data types  

---

### 🧠 3. Selection & Filtering  
📌 Get specific rows/columns/conditions  
🔧  
- `df['col']` 📌 – Access column  
- `df[['col1', 'col2']]` 🧲 – Multiple columns  
- `df.iloc[rows, cols]` 🔢 – Integer-location access  
- `df.loc[rows, cols]` 🔤 – Label-based access  
- `df[df['col'] > 5]` 🔍 – Conditional filter  
- `df.query('col > 5')` 🧠 – SQL-style filtering  

---

### 🧠 4. Data Cleaning  
📌 Handle missing, duplicates, and type fixes  
🔧  
- `df.isnull()` ❓ – Check NaNs  
- `df.dropna()` 🗑 – Remove NaNs  
- `df.fillna(value)` 💧 – Fill NaNs  
- `df.duplicated()` 🧬 – Find duplicates  
- `df.drop_duplicates()` 🚮 – Remove duplicates  
- `df.astype(type)` 🔁 – Convert types  

---

### 🧠 5. Sorting & Ranking  
📌 Organize and rank data  
🔧  
- `df.sort_values(by='col')` 🔃 – Sort by column  
- `df.sort_index()` 🔢 – Sort by index  
- `df.rank()` 🏅 – Ranking values  

---

### 🧠 6. Aggregation & Grouping  
📌 Summarize and analyze grouped data  
🔧  
- `df.groupby('col')` 🧩 – Group by  
- `df.groupby('col').agg(['mean', 'sum'])` 📊 – Aggregate  
- `df.pivot_table()` 🔄 – Pivot summary  
- `df.value_counts()` 🔢 – Count unique  
- `df.crosstab()` 🔀 – Cross tabulation  

---

### 🧠 7. Merging & Joining  
📌 Combine datasets  
🔧  
- `pd.concat([df1, df2])` ➕ – Stack vertically/horizontally  
- `pd.merge(df1, df2, on='key')` 🔗 – SQL-style join  
- `df.join(other_df)` 🤝 – Join by index  

---

### 🧠 8. Apply & Mapping  
📌 Element-wise transformations  
🔧  
- `df['col'].map(func)` 🧠 – Map values  
- `df.apply(func)` ⚙️ – Apply function to rows/cols  
- `df.applymap(func)` 🔁 – Apply to every cell (element-wise)  

---

### 🧠 9. Time Series  
📌 Time-based indexing & resampling  
🔧  
- `pd.to_datetime()` 🕒 – Convert to datetime  
- `df.resample('M')` 📆 – Resample by time  
- `df['date'].dt.year` 📅 – Extract year  

---

### 🧠 10. Exporting & I/O  
📌 Save and load from files  
🔧  
- `df.to_csv('file.csv')` 💾  
- `df.to_excel('file.xlsx')` 📤  
- `df.to_json('file.json')` 📦  
- `pd.read_sql(query, conn)` 🛢 – SQL to DataFrame  

---

### ⚙️ Bonus Utilities  
📌 Handy tricks & performance  
🔧  
- `pd.set_option('display.max_columns', None)` 🖥 – Show all columns  
- `df.memory_usage()` 📊 – Check memory  
- `df.sample(n)` 🎯 – Random sample  
- `df.nunique()` 🧮 – Count unique per column  
- `df.corr()` 📈 – Correlation matrix  


---

# **🤖Pandas for Machine Learning – Advanced Cheatsheet**

---

### 📥 1. Data Loading & Preparation  
📌 Get and prepare data for ML models  
🔧  
- `pd.read_csv('data.csv')` 📄 – Load dataset  
- `df.sample(frac=0.1)` 🎯 – Random sample  
- `df.drop(columns=['col'])` 🗑 – Drop unwanted columns  
- `df.rename(columns={'old': 'new'})` 🏷 – Rename columns  
- `df.reset_index(drop=True)` 🔄 – Reset index  

---

### 🧹 2. Data Cleaning  
📌 Clean up messy real-world data  
🔧  
- `df.isnull().sum()` ❓ – Count NaNs  
- `df.dropna()` 🚮 – Drop rows with NaNs  
- `df.fillna(value)` 💧 – Fill missing  
- `df.duplicated()` 🧬 – Detect duplicates  
- `df.drop_duplicates()` 🧹 – Drop duplicates  

---

### 🧪 3. Feature Engineering  
📌 Create, modify, or encode features  
🔧  
- `df['new'] = df['col1'] + df['col2']` ➕ – Create new feature  
- `pd.get_dummies(df, drop_first=True)` 🧯 – One-hot encoding  
- `df['col'].map({'A': 0, 'B': 1})` 🔁 – Label encoding  
- `df['text'].str.extract(r'regex')` 🧠 – Text feature extraction  
- `df['col'].apply(lambda x: x**2)` 🧪 – Feature transformation  

---

### 🏗 4. Data Splitting  
📌 Prepare train/test/validation sets  
🔧  
- `from sklearn.model_selection import train_test_split` ✂️  
- `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)` 🔍 – Train-test split  

---

### 🧮 5. Scaling & Normalization  
📌 Preprocess features before training  
🔧  
- `from sklearn.preprocessing import StandardScaler` 📊  
- `scaler = StandardScaler()`  
- `X_scaled = scaler.fit_transform(X)` 🧼 – Z-score scaling  

Other options:  
- `MinMaxScaler()` 📏 – Normalize to [0, 1]  
- `RobustScaler()` 🧱 – Handle outliers  

---

### 📊 6. Correlation & Stats  
📌 Explore data relationships  
🔧  
- `df.corr()` 🔗 – Correlation matrix  
- `df['target'].value_counts()` 🧮 – Class balance check  
- `df.groupby('label').mean()` 📐 – Stats by group  

---

### 🔍 7. Model Evaluation Helpers  
📌 Analyze results after predictions  
🔧  
- `df['pred_error'] = df['y_true'] - df['y_pred']` 🧾 – Error column  
- `df['correct'] = df['y_true'] == df['y_pred']` ✅ – Boolean match  
- `df['prob'] = model.predict_proba(X)[:,1]` 📈 – Probabilities  
- `df['conf'] = np.max(model.predict_proba(X), axis=1)` 🧪 – Confidence scores  

---

### 📦 8. Export for Modeling  
📌 Save cleaned/preprocessed data  
🔧  
- `df.to_csv('cleaned_data.csv', index=False)` 💾  
- `df.to_pickle('df.pkl')` 🧺 – Save with types  
- `df.to_parquet('df.parquet')` 🚀 – For big data ML  

---

### ⚙️ Extra Tricks for ML Pipelines  
📌 Helpful utilities used in real ML workflows  
🔧  
- `df.select_dtypes(include='number')` 🔢 – Numeric columns only  
- `df.columns[df.isnull().any()]` 🚨 – Columns with NaNs  
- `df['cat'] = df['cat'].astype('category')` 🧬 – Convert to categorical  
- `df.sort_values(by='importance', ascending=False)` 📌 – Feature importance  

---

# **🤖Scikit-learn + Pandas Integration Cheatsheet**

---

### 📋 1. Prepare Features & Target  
📌 Slice your DataFrame into X (features) and y (target)  
```python
X = df.drop('target', axis=1)  # 🎯 Features
y = df['target']               # 🏷 Target
```

---

### ✂️ 2. Train-Test Split  
📌 Split DataFrame for training and testing  
```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```

---

### 🧼 3. Use Pandas with Scalers  
📌 Keep column names even after scaling  
```python
from sklearn.preprocessing import StandardScaler
import pandas as pd

scaler = StandardScaler()
X_scaled = pd.DataFrame(
    scaler.fit_transform(X),
    columns=X.columns
)
```

---

### 🏗 4. Build Pipelines with Pandas  
📌 Combine preprocessing + modeling (safe with DataFrames!)  
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)
```

---

### 🧱 5. ColumnTransformer with Pandas  
📌 Apply different transforms to numeric vs categorical columns  
```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

numeric_cols = X.select_dtypes(include='number').columns
cat_cols = X.select_dtypes(include='object').columns

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

full_pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier())
])

full_pipeline.fit(X_train, y_train)
```

---

### 📈 6. Predictions with Pandas Index  
📌 Return predictions with original row index  
```python
y_pred = pd.Series(pipeline.predict(X_test), index=X_test.index)
```

---

### 🧪 7. Cross-Validation with DataFrame  
📌 No need to convert to arrays – it just works  
```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X, y, cv=5)
```

---

### 🧾 8. Feature Importances with Columns  
📌 Get feature importance + original column names  
```python
model = pipeline.named_steps['model']
features = preprocessor.get_feature_names_out()
pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
```

---

### 🧠 9. Export Model & DataFrame Together  
📌 Save full pipeline (with preprocessing)  
```python
import joblib
joblib.dump(full_pipeline, 'model_pipeline.pkl')  # 💾 Save full model
```

---

### ✅ 10. Clean Predictions with Original Data  
📌 Combine predictions with the original DataFrame  
```python
df_result = df.copy()
df_result['pred'] = pipeline.predict(X)
```

---