# **Programming for Data Science Final Project**

**Student Information:**

StudentID|Full Name
-|-
21127012|Tran Huy Ban
21127143|Nguyen Minh Quan


## **Table of contents**

[Overview](#overview)


[References](#references)

## **Overview** <a name="overview"></a>

<center>
<h3>
    <b>
    Lung Cancer Prediction: Learn about and stay healthy related to lung cancer
    </b>
</h3>
    <img style="padding:10px" src="https://www.narayanahealth.org/sites/default/files/pillar-page/lung-cancer-banner-bg.jpg" width="800"/>
</center>
Lung cancer is a disease that is not too unfamiliar to us; in fact, it is quite common. We often associate this condition with air pollution or, more commonly, the harmful effects of smoking. In the current situation where air pollution is on the rise, respiratory diseases are becoming more prevalent, making lung cancer more likely to develop. However, lung cancer has various symptoms, and through this dataset based on 1000 surveys, we will explore these symptoms to gain a better understanding and learn how to maintain our health.




### **Libraries used**

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## **1. Data Collection** <a name="collect"></a>

- **Data about**: data on factors that may be related to lung cancer, the severity and extent of lung cancer. 

- **Source**: From Kaggle 

- **Collect data and License**: From 1000 people from over 462,000 people in China who were followed for an average of six years. This was in the study, which was published in the journal Nature Medicine. The participants were divided into two groups: those who lived in areas with high levels of air pollution and those who lived in areas with low levels of air pollution. The researchers found that the people in the high-pollution group were more likely to develop lung cancer than those in the low-pollution group. They also found that the risk was higher in nonsmokers than smokers, and that the risk increased with age. While this study does not prove that air pollution causes lung cancer, it does suggest that there may be a link between the two. More research is needed to confirm these findings and to determine what effect different types and levels of air pollution may have on lung cancer risk

In [3]:
df = pd.read_csv('Data/lung cancer.csv')
df.head()

Unnamed: 0,index,Patient Id,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,...,Fatigue,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring,Level
0,0,P1,33,1,2,4,5,4,3,2,...,3,4,2,2,3,1,2,3,4,Low
1,1,P10,17,1,3,1,5,3,4,2,...,1,3,7,8,6,2,1,7,2,Medium
2,2,P100,35,1,4,5,6,5,5,4,...,8,7,9,2,1,4,6,7,2,High
3,3,P1000,37,1,7,7,7,7,6,7,...,4,2,3,1,4,5,6,7,5,High
4,4,P101,46,1,6,8,7,7,7,6,...,3,2,4,1,4,2,4,2,3,High


## **2. Data Exploring and Preprocessing** <a name="process"></a>

#### Firstly, We should find out what our dataset contains.

In [4]:
df.shape

(1000, 26)

So, there are 1000 rows and 26 columns in our dataset

#### What is the meaning of each row 

// mean row

#### Does the data have duplicate rows


In [5]:
duplicate_rows = df[df.duplicated()]
if duplicate_rows.empty:
    print('No duplicate rows found')
else:
    print('Duplicate rows found')
    print(duplicate_rows)


No duplicate rows found


#### What is the meaning of each columns 

In [6]:
df.columns


Index(['index', 'Patient Id', 'Age', 'Gender', 'Air Pollution', 'Alcohol use',
       'Dust Allergy', 'OccuPational Hazards', 'Genetic Risk',
       'chronic Lung Disease', 'Balanced Diet', 'Obesity', 'Smoking',
       'Passive Smoker', 'Chest Pain', 'Coughing of Blood', 'Fatigue',
       'Weight Loss', 'Shortness of Breath', 'Wheezing',
       'Swallowing Difficulty', 'Clubbing of Finger Nails', 'Frequent Cold',
       'Dry Cough', 'Snoring', 'Level'],
      dtype='object')

// mean col

#### what is the current data type of each columns? are there any columns having inappropriate data types ?

In [7]:
df.dtypes

index                        int64
Patient Id                  object
Age                          int64
Gender                       int64
Air Pollution                int64
Alcohol use                  int64
Dust Allergy                 int64
OccuPational Hazards         int64
Genetic Risk                 int64
chronic Lung Disease         int64
Balanced Diet                int64
Obesity                      int64
Smoking                      int64
Passive Smoker               int64
Chest Pain                   int64
Coughing of Blood            int64
Fatigue                      int64
Weight Loss                  int64
Shortness of Breath          int64
Wheezing                     int64
Swallowing Difficulty        int64
Clubbing of Finger Nails     int64
Frequent Cold                int64
Dry Cough                    int64
Snoring                      int64
Level                       object
dtype: object

// CMT about types col

#### With each numerical column, how are values distributed? what is the percentage of missing values? Min? max? are they abnormal?


In [8]:
numeric_columns = df.select_dtypes(include=['number'])
numeric_columns.describe()

Unnamed: 0,index,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,Balanced Diet,...,Coughing of Blood,Fatigue,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,499.5,37.174,1.402,3.84,4.563,5.165,4.84,4.58,4.38,4.491,...,4.859,3.856,3.855,4.24,3.777,3.746,3.923,3.536,3.853,2.926
std,288.819436,12.005493,0.490547,2.0304,2.620477,1.980833,2.107805,2.126999,1.848518,2.135528,...,2.427965,2.244616,2.206546,2.285087,2.041921,2.270383,2.388048,1.832502,2.039007,1.474686
min,0.0,14.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,249.75,27.75,1.0,2.0,2.0,4.0,3.0,2.0,3.0,2.0,...,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
50%,499.5,36.0,1.0,3.0,5.0,6.0,5.0,5.0,4.0,4.0,...,4.0,3.0,3.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0
75%,749.25,45.0,2.0,6.0,7.0,7.0,7.0,7.0,6.0,7.0,...,7.0,5.0,6.0,6.0,5.0,5.0,5.0,5.0,6.0,4.0
max,999.0,73.0,2.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,...,9.0,9.0,8.0,9.0,8.0,8.0,9.0,7.0,7.0,7.0


#### How values distributed with categorical column ?

In [11]:
cat_col_info_df = df.select_dtypes(include='object')

def missing_ratio(s):
    # raise NotImplementedError()
    return (s.isna().mean() * 100).round(1)

def num_values(s):
    # raise NotImplementedError()
    s = s.str.split(';')
    s = s.explode()
    return len(s.value_counts())

def value_ratios(s):
    # raise NotImplementedError()
    s = s.str.split(';')
    s = s.explode()
    totalCount = (~s.isna()).sum()
    return ((s.value_counts()/totalCount*100).round(1)).to_dict()

cat_col_info_df = cat_col_info_df.agg([missing_ratio, num_values, value_ratios])
cat_col_info_df

Unnamed: 0,Patient Id,Level
missing_ratio,0.0,0.0
num_values,1000,3
value_ratios,"{'P1': 0.1, 'P702': 0.1, 'P691': 0.1, 'P692': ...","{'High': 36.5, 'Medium': 33.2, 'Low': 30.3}"
