<a href="https://colab.research.google.com/github/gkarthick510/Project/blob/main/Lung_cancer_prediction_Hicounselor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Python**

This project aims to comprehensively analyze the Lung Cancer Data Set, highlighting the significant impact of smoking on health, leading to lung cancer. By examining data trends, we'll uncover how smoking relates to lung cancer across different age groups and pinpoint the age group with the highest smoking rates. Through visualization, we'll illuminate the connections between smoking behaviors and their health outcomes, providing insights into the varying effects on different age groups.

In [5]:
import pandas as pd

In [6]:
df=pd.read_csv("/content/lung_cancer_examples(1).csv")
df.head()

Unnamed: 0,Name,Surname,Age,Smokes,AreaQ,Alkhol,Result
0,John,Wick,35,3,5,4,1
1,John,Constantine,27,20,2,5,1
2,Camela,Anderson,30,0,5,2,0
3,Alex,Telles,28,0,8,1,0
4,Diego,Maradona,68,4,5,6,1


In [11]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')


def read_csv():
    # Method to read the CSV file "lung_cancer_examples.csv" using pandas.
    df=pd.read_csv("lung_cancer_examples.csv")
    return df


def check_duplicates():
    df =read_csv()
    # Method to check for duplicate rows in the DataFrame.
    # Returns: The number of duplicated rows found in the DataFrame.
    df=df.duplicated().sum()
    return df



def check_null_values():
    df = read_csv()
    # Method to check for null (missing) values in the DataFrame.
    # Returns: A pandas Series indicating the count of null values for each column in the DataFrame.
    df=df.isnull().sum()
    return df


def rename_column():
    # do not edit the predefined function name
    df = read_csv()
    # Rename columns 'Alkhol' to 'Alcohol'.
    df.rename(columns={"Alkhol":"Alcohol"},inplace=True)
    return df


def check_smoke_value():
    # do not edit the predefined function name
    data = rename_column()

    # Count the occurrences of each unique value in the 'Smokes' column

    data=data['Smokes'].value_counts()
    # Return the counts of each unique smoking habit value
    return data



In [None]:
import module1 as t1
import pandas as pd




# Function to categorize individuals based on the number of cigarettes smoked per day
def categorize_smokers(x):

    #If x is 0, categorize the person as 'Non-Smokers'.
    if x==0:
        return "Non-Smokers"
    # If x is less than or equal to 2, categorize the person as 'Light Smokers'.
    elif x<=2:
        return "Light Smokers"
    # If x is greater than 2 and less than or equal to 10, categorize the person as 'Mediocre Smokers'.
    elif x>2 and x<=10:
        return "Mediocre Smokers"
    # If x is greater than 10, categorize the person as 'Heavy Smokers'.
    else:
        return "Heavy Smokers"

    pass



# Function to process the smoking data and add a new 'Smoking_Category' column
def smokes():
    # do not edit the predefined function name
    data = t1.rename_column()

    # Applying the 'categorize_smokers' function to each value in the 'Smokes' column
    data["Smoking_Category"]=data["Smokes"].map(categorize_smokers)
    # and storing the result in a new column 'Smoking_Category'

    # Returning the modified dataset with the new 'Smoking_Category' column
    return data




def check_alcohol_value():
    # do not edit the predefined function name
    data = smokes()

    data=data["Alcohol"].value_counts()
    # Count the occurrences of each unique value in the 'Alcohol' column
    # Return the counts of each unique smoking habit value
    return data







# Function to categorize individuals based on the number of alcohol drinks consumed per day
def categorize_alcohol(x):
    if x==0:
        return 'Non-Drinkers'
    # If x is 0, categorize the person as 'Non-Drinkers'.
    elif x<=2:
        return "Light Drinkers"
    # If x is less than or equal to 2, categorize the person as 'Light Drinkers'.
    elif x>2 and x<=10:
        return  'Mediocre Drinkers'
    # If x is greater than 2 and less than or equal to 10, categorize the person as 'Mediocre Drinkers'.
    else:
        return "Heavy Drinkers"
    # If x is greater than 10, categorize the person as 'Heavy Drinkers'.
    pass


# Function to process the alcohol data and add a new 'Alcohol_Category' column
def alkhol():
    # Assuming the 'smokes()' function retrieves the dataset with the 'Smokes' column and the 'Alcohol' column
    data = smokes()
    # Applying the 'categorize_alcohol' function to each value in the 'Alcohol' column
    # and storing the result in a new column 'Alcohol_Category'
    data["Alcohol_Category"]=data["Alcohol"].map(categorize_alcohol)

    # Returning the modified dataset with the new 'Alcohol_Category' column
    return data




def export_the_dataset():
    # do not edit the predefined function name
    df = alkhol()
    # write your code to export the cleaned dataset and set the index=false and return the same as 'df'
    df.to_csv('lung_cancer.csv', index=False)
    return df

# **MYSQL**

How many values are there in the given dataset

In [None]:
select count(*) from lung_cancer;

Select the average age of individuals in the given dataset

In [None]:
SELECT Avg(Age) as AVG_AGE from lung_cancer

Select the total count of 'Smokers' in the given dataset

In [None]:
SELECT count(distinct Smoking_category) FROM lung_cancer where Smoking_category!='Non-Smokers'

Select the 'Name', 'Age', and 'Alcohol Category' columns for 'Mediocare Drinkers'

In [None]:
SELECT Name,Age,Alcohol_Category from lung_cancer
where Alcohol_Category='Mediocre Drinkers'
group by Name,Age,Alcohol_Category

Select the 'Name' and 'Age' of the oldest individual in the given dataset.

In [None]:
SELECT Name,Age FROM lung_cancer GROUP by Name,Age order by Age DESC limit 2

Select the 'Name' and 'Surname' of individuals whose names start with 'A'.

In [None]:
SELECT Name,Surname FROM lung_cancer WHERE Name LIKE'A%'

Select the 'Name', 'Age', and 'Alcohol' columns for individuals who are both 'Heavy Smokers' and 'Mediocare Drinkers

In [None]:
select Name,Age,Alcohol from lung_cancer where Smoking_Category= 'Heavy Smokers' and Alcohol_Category= 'Mediocre Drinkers' group by Name,Age,Alcohol

Find out the percentage of lung cancer for individuals whose age is greater than 18.

Hint: Result column specifies the cancer detection.

Note : Use 100.0 while finding the percentage.

In [None]:
SELECT Result,count(*),count(*)*100.0/sum(count(*)) over()FROM `lung_cancer` WHERE age>18 group by Result

Select the names and ages of individuals whose names contain the word "John".

In [None]:
SELECT name,age FROM `lung_cancer` WHERE name like '%John%' or Surname like'John' group by name,age

Find the count of people who have lung cancer with different 'Smoking Category'.

In [None]:
SELECT Smoking_Category,count(Result) FROM lung_cancer where Result='1' GROUP by Smoking_Category


Find the count of people who have lung cancer with different 'Alcohol Category'.

In [None]:
SELECT Alcohol_Category,count(Result) FROM lung_cancer where result='1' GROUP by Alcohol_Category