In [None]:
# Trying my best a learn how to use GitHub and Jupyter Notebooks. 
# Read that an appendix should be created for use of ChatGPT or other AI Assistance.
# I want to make sure that I am following the rules. Please advise if I should be doing anything differently.


# BU AI Use Policy: https://www.bu.edu/cds-faculty/culture-community/gaia-policy/
# BU APA Formatted AI Use Citation: "OpenAI. (2024). Transcript of conversation with problem-solving solutions. Unpublished interaction transcript. Retrieved from OpenAI’s ChatGPT platform. In accordance with Boston University GAIA policy."


# BU AI Use Citation Requirements:
# b) ChatGPT (Subscription) 
# c) Primarily will be using ChatGPT to prime myself for how to solution for problems within an assignment, what libraries most help with solving the problem, and then walking me through the steps to understanding how to get to the solution. ChatGPT, is helping me bridge a gap in my understand of converting course concepts to code;
# d) AI is primarily being used as a preceptor that translates teachings information into to practical development knowledge to better understand the steps to the solution, learn optimal development to a given problem to form wisdom;

# Initial Dialog: 
# Asked for help with a homework assignment using GitHub Codespaces, Python, and Jupyter Notebooks. 
# Provided libraries to be used and mentioned that I would need to cite the use of ChatGPT in the appendix.
# Requested help in solving problems seaking step by step instructional breakdown for learning for 10 questions.


# a) the entire exchange, highlighting the most relevant sections;

# Problem Statements and Solution Steps Summary with Detailed Explanations

# Summary:
# **Jupyter Notebook Transcript - Full Conversation**

# ## Appendix: Data Science Project Conversation Summary
# 
# This appendix contains a summarized transcript of the key discussions and solutions implemented
# while working on a data science project focused on manufacturing defects.
# 
# ---

# ### **1. Setting Up the Kaggle Dataset in GitHub Codespaces**
# - The project uses a Kaggle dataset on **manufacturing defects**.
# - Faced issues with downloading data using the Kaggle API due to missing authentication.
# - Solution: Uploaded `kaggle.json` to Codespaces and configured it properly.
# ```python
# mkdir -p ~/.kaggle
# mv /workspaces/Nic-Dumay-2025-spring-B2/kaggle.json ~/.kaggle/kaggle.json
# chmod 600 ~/.kaggle/kaggle.json
# kaggle datasets download -d fahmidachowdhury/manufacturing-defects --unzip
# ```
# - Successfully loaded the dataset into a Pandas DataFrame.

# ---

# ### **2. Data Cleaning and Handling Missing Values**
# - Identified missing values in key columns.
# - Used different strategies: filling with default values, forward-filling, and removing incomplete rows.
# ```python
# df_defects_datefix = df_defects_datefix.fillna({
#     "defect_location": "Unknown",
#     "defect_type": "Unknown",
#     "inspection_method": "Manual"
# })
# ```

# ---

# ### **3. Converting `defect_date` to Datetime Format**
# - Ensured that `defect_date` was correctly formatted before analysis.
# ```python
# df_defects_datefix["defect_date"] = pd.to_datetime(df_defects_datefix["defect_date"], errors='coerce')
# ```

# ---

# ### **4. Handling Outliers Using Interquartile Range (IQR)**
# - Applied IQR method to detect and filter extreme values.
# ```python
# Q1 = df_defects_datefix["Age"].quantile(0.25)
# Q3 = df_defects_datefix["Age"].quantile(0.75)
# IQR = Q3 - Q1
# lower_bound = Q1 - 1.5 * IQR
# upper_bound = Q3 + 1.5 * IQR
# outliers = df_defects_datefix[(df_defects_datefix["Age"] < lower_bound) | (df_defects_datefix["Age"] > upper_bound)]
# ```

# ---

# ### **5. One-Hot Encoding for Categorical Variables**
# - Converted categorical features into numerical columns using one-hot encoding.
# ```python
# categorical_columns = ["defect_type", "defect_location", "severity", "inspection_method"]
# df_defects_encoded = pd.get_dummies(df_defects_datefix, columns=categorical_columns, prefix=categorical_columns, dtype=int)
# ```
# - Ensured boolean values were converted to integers.
# ```python
# df_defects_encoded[["defect_type_Cosmetic", "defect_type_Functional"]] = df_defects_encoded[["defect_type_Cosmetic", "defect_type_Functional"]].astype(int)
# ```

# ---

# ### **6. Data Visualization: Trend of Defects Over Time**
# - Created a **line chart** to visualize defect trends across months.
# ```python
# import matplotlib.pyplot as plt
# df_defects_datefix["Month_Year"] = df_defects_datefix["defect_date"].dt.to_period("M")
# defect_trend = df_defects_datefix.groupby("Month_Year")["defect_id"].count()
# defect_trend.index = defect_trend.index.to_timestamp()
# plt.figure(figsize=(12,6))
# plt.plot(defect_trend, color='steelblue', linewidth=2)
# plt.title("Defect Trends Over Time", fontsize=14, fontweight="bold")
# plt.xlabel("Month")
# plt.ylabel("Number of Defects")
# plt.xticks(rotation=45)
# plt.grid(axis="y", linestyle="--", alpha=0.7)
# plt.show()
# ```

# ---

# ### **7. Annotating Key Points in the Trend Graph**
# - Highlighted specific months with notable defect trends.
# ```python
# annotations = {
#     "2024-02": "Post-holiday production ramp-up",
#     "2024-06": "Mid-year quality audits",
#     "2024-12": "End-of-year rush defects",
# }
# for date, note in annotations.items():
#     plt.annotate(note, xy=(pd.Timestamp(date), defect_trend[date]), xytext=(pd.Timestamp(date), defect_trend[date] + 5),
#                  arrowprops=dict(facecolor='black', arrowstyle='->'), fontsize=10)
# ```

# ---

# ### **8. Stacked Bar Chart for Defect Types per Month**
# - Created a **stacked horizontal bar chart** where:
#   - **Y-axis = Month_Year**
#   - **Stacked bars = Different defect types**
#   - **Normalized proportions = 100% scale per month**
# ```python
# df_pivot = df_defects_datefix.pivot_table(index="Month_Year", columns="defect_type", values="defect_id", aggfunc="count", fill_value=0)
# df_pivot_percent = df_pivot.div(df_pivot.sum(axis=1), axis=0) * 100
# fig, ax = plt.subplots(figsize=(10, 6))
# df_pivot_percent.plot(kind="barh", stacked=True, ax=ax, colormap="Purples")
# ax.set_title("Defect Type Distribution per Month", fontsize=14, fontweight="bold")
# ax.set_xlabel("Percentage of Defects")
# ax.set_ylabel("Month-Year")
# ax.legend(title="Defect Type", bbox_to_anchor=(1.05, 1), loc="upper left")
# plt.tight_layout()
# plt.show()
# ```

# ---

# ### **9. Customizing the Color Palette**
# - Used **green, purple, and gray tones** instead of `coolwarm`.
# - Applied **custom colors** for different defect types.
# ```python
# custom_colors = ["#6a0dad", "#228B22", "#808080"]  # Purple, Green, Gray
# df_pivot_percent.plot(kind="barh", stacked=True, ax=ax, color=custom_colors)
# ```

# ---

# ### **Final Notes**
# - Successfully preprocessed, cleaned, and visualized the manufacturing defect dataset.
# - Implemented best practices from **Storytelling with Data**.
# - Explored various chart types (line, stacked bar, annotations).
# - Ensured all categorical variables were properly encoded.
# - Improved the **visual storytelling** aspect using **annotations and custom colors**.
