<a href="https://colab.research.google.com/github/hadasssahamase/Cervical_Cancer_project/blob/bev%2Fupdates/CerviAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Understanding


## 2.1 Data Overview
- The dataset contains **859 Records** and **36 Features**
For the datatypes, **10 features** are of **int** type and **26 features** are of **object** type (including numerical data stored as strings).
- The dataset contains information on demographic, behavioral, and medical history variables relevant to cervical cancer risk, including:
  - Age
  - Sexual activity (e.g., number of sexual partners, age at first intercourse)
  - Smoking habits
  - Use of contraceptives and IUDs
  - STDs history
  - Diagnosis outcomes (e.g., cancer, HPV, CIN)



## 2.2 Data Source
The data is sourced from a credited site `https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors`.
The name of the dataset is `risk_factors_cervical_cancer.csv`

In [6]:
# importing the necessary libraries to read the dataset
import pandas as pd
import numpy as np

In [11]:
# loading the dataset
df = pd.read_csv('risk_factors_cervical_cancer.csv')

# checking the first records
df.head(30)

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,?,?,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,?,?,0,0,0,0,0,0,0,0
2,34,1.0,?,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,?,?,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,...,?,?,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,...,?,?,0,0,0,0,0,0,0,0
5,42,3.0,23.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,?,?,0,0,0,0,0,0,0,0
6,51,3.0,17.0,6.0,1.0,34.0,3.4,0.0,0.0,1.0,...,?,?,0,0,0,0,1,1,0,1
7,26,1.0,26.0,3.0,0.0,0.0,0.0,1.0,2.0,1.0,...,?,?,0,0,0,0,0,0,0,0
8,45,1.0,20.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,?,?,1,0,1,1,0,0,0,0
9,44,3.0,15.0,?,1.0,1.266972909,2.8,0.0,0.0,?,...,?,?,0,0,0,0,0,0,0,0


## 2.3 Data Description
### Numeric Columns (e.g., Age, diagnosis fields)
- `Age`: ranges from 13 to 84, mean ≈ 26.8
- Diagnosis labels like Dx:Cancer, Dx:CIN, Biopsy are binary (0 or 1)

### Object Columns (contain many numeric-looking values)
- Some contain `?` indicating missing or unknown values.
- Examples:
  - `Number of sexual partners`
  - `Smokes` (years)
  - `Hormonal Contraceptives` (years)

These should be cleaned and converted to numeric types.

## 2.4 Statistical Summary

In [8]:
# checking the dataset info and their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 858 entries, 0 to 857
Data columns (total 36 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   Age                                 858 non-null    int64 
 1   Number of sexual partners           858 non-null    object
 2   First sexual intercourse            858 non-null    object
 3   Num of pregnancies                  858 non-null    object
 4   Smokes                              858 non-null    object
 5   Smokes (years)                      858 non-null    object
 6   Smokes (packs/year)                 858 non-null    object
 7   Hormonal Contraceptives             858 non-null    object
 8   Hormonal Contraceptives (years)     858 non-null    object
 9   IUD                                 858 non-null    object
 10  IUD (years)                         858 non-null    object
 11  STDs                                858 non-null    object

In [9]:
# checking the shape of the dataset
df.shape

(858, 36)

In [10]:
# checking the dataset description
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,858.0,26.820513,8.497948,13.0,20.0,25.0,32.0,84.0
STDs: Number of diagnosis,858.0,0.087413,0.302545,0.0,0.0,0.0,0.0,3.0
Dx:Cancer,858.0,0.020979,0.143398,0.0,0.0,0.0,0.0,1.0
Dx:CIN,858.0,0.01049,0.101939,0.0,0.0,0.0,0.0,1.0
Dx:HPV,858.0,0.020979,0.143398,0.0,0.0,0.0,0.0,1.0
Dx,858.0,0.027972,0.164989,0.0,0.0,0.0,0.0,1.0
Hinselmann,858.0,0.040793,0.197925,0.0,0.0,0.0,0.0,1.0
Schiller,858.0,0.086247,0.280892,0.0,0.0,0.0,0.0,1.0
Citology,858.0,0.051282,0.220701,0.0,0.0,0.0,0.0,1.0
Biopsy,858.0,0.064103,0.245078,0.0,0.0,0.0,0.0,1.0


## 2.5 Data Quality Assessment
After light exploration, many binary fields (e.g., `Smokes`, `Hormonal Contraceptives`, `STDs:HPV`) suggest yes/no features. Class imbalance is likely (e.g., only 2% positive for `Dx:Cancer`).

- **Missing Data**: Present as `?`, needs to be converted to `NaN`.
- **Incorrect Data Types**: Numeric values stored as `object`.
- **Imbalanced Classes**: Target labels are sparse (e.g., Biopsy = 1 is rare).
- **Potential Outliers**: Some very high values (e.g., `Smokes (years)` = 37).

## 2.6 Next Steps?
- Convert `?` to `NaN` and cast appropriate columns to **float**.
- Analyze missing data proportions.
- Consider imputation or row/column removal depending on sparsity.
- Visualize key distributions and relationships (e.g., `Age` vs `Biopsy` outcome).

# dropping columns with high unknowns
