# **Background**

According to the WHO, breast cancer is the most commonly occurring cancer worldwide. In 2020 alone, there were 2.3 million new breast cancer diagnoses and 685,000 deaths. Regular mammograms (mammogram is the low-dose x-rays of the breast) is the process which can helps to detect breast cancer at its early stage, followed by other methodologies like ultrasounds, other imaging tests and biopsies. 
Depending on the patient's condition, the mammography test price ranges from Rs. 800 and could go up to Rs. 3000 in India. Currently, screening mammography systems are expensive to operate since early diagnosis of breast cancer necessitates the expertise of highly trained human observers. This issue will probably get worse due to an impending scarcity of radiologists in various countries. A high prevalence of false positive results is another side effect of mammography screening. This may lead to unwarranted concern, difficult follow-up care, further imaging tests, and occasionally the requirement for tissue sampling (often a needle biopsy). 

# Goal of the Competition

The objective of this challenge is to create a model to identify breast cancer from screening mammograms received from routine screening.

## Required Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import missingno as msno
import plotly_express as px
import pydicom as dicom
from pydicom import dcmread


## Loading the Dataset

In [None]:
Train_data = pd.read_csv('/kaggle/input/rsna-breast-cancer-detection/train.csv')
Test_data = pd.read_csv('/kaggle/input/rsna-breast-cancer-detection/test.csv')

In [None]:
Train_data.head()

## Data Description

| Attributes | Details | 
| --- | --- |
| site_id | ID code for the source hospital. | 
| patient_id | ID code for the patient. | 
| image_id | ID code for the image. | 
| laterality | Whether the image is of the left or right breast. | 
| view | The orientation of the image. The default for a screening exam is to capture two views per breast. | 
| age | The patient's age in years. | 
| implant | Whether or not the patient had breast implants. Site 1 only provides breast implant information at the patient level, not at the breast level. | 
| density | A rating for how dense the breast tissue is, with A being the least dense and D being the most dense. Extremely dense tissue can make diagnosis more difficult. | 
| machine_id | An ID code for the imaging device. | 
| cancer | Whether or not the breast was positive for cancer. The target value. Only provided for train. |
| biopsy | Whether or not a follow-up biopsy was performed on the breast. Only provided for train. |
| invasive | If the breast is positive for cancer, whether or not the cancer proved to be invasive. Only provided for train. |
| BIRADS | Breast Imaging Reporting & Data System output. 0 if the breast required follow-up, 1 if the breast was rated as negative for cancer, and 2 if the breast was rated as normal. Only provided for train. |
| prediction_id | The ID for the matching submission row. Multiple images will share the same prediction ID. Test only. |
| difficult_negative_case | True if the case was unusually difficult. Only provided for train. | 

## Data Inspection

In [None]:
print('Train Dataset')
print('Number of ROWs:', Train_data.shape[0], '\nNumber of COLOUMNs:', Train_data.shape[1])
print('---------------------------------------------------------------------------')
print('Test Dataset')
print('Number of ROWs:', Test_data.shape[0], '\nNumber of COLOUMNs:', Test_data.shape[1])

In [None]:
print('Number of unique patients:',Train_data.patient_id.nunique())
print('Number of unique machine_id:',Train_data.machine_id.nunique())
# print('Average mammogram for each patient:',Train_data.groupby("patient_id")["image_id"].agg("count").reset_index()["image_id"].mean())

In [None]:
Train_data.info()

In [None]:
Train_data.describe().T

In [None]:
msno.bar(Train_data,figsize=(20, 5),fontsize=12)
plt.grid()

In [None]:
dups = Train_data.duplicated()
print('Total no of duplicate values in dataset = %d' % (dups.sum()))

Train_data[dups]

In [None]:
plt.figure(figsize = (22,7))
plt.subplot(1,2,1)
sns.countplot(x = Train_data.BIRADS)
plt.xticks(np.arange(3), ['Required follow-up', 'Negative for cancer','Rated as normal'], fontsize=12)
plt.xlabel('Breast Imaging Reporting & Data System Output', fontsize=12)
plt.ylabel('Count', fontsize=12)

plt.subplot(1,2,2)
sns.boxplot(x = Train_data.cancer, y = Train_data.age, data = Train_data) 
plt.xticks(np.arange(2), ['Negative', 'Positive'], fontsize=12)
plt.xlabel('Breast Cancer', fontsize=12)
plt.ylabel('Age', fontsize=12)
plt.show()

In [None]:
plt.figure(figsize = (24,5))
plt.subplot(1,3,1)
sns.countplot(x = Train_data.cancer)
plt.xticks(np.arange(2), ['Negative', 'Positive'], fontsize=12)
plt.xlabel('Breast Cancer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.subplot(1,3,2)
sns.countplot(x = Train_data.biopsy)
plt.xticks(np.arange(2), ['Performed', 'Not Performed'], fontsize=12)
plt.xlabel('Follow-up Biopsy', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.subplot(1,3,3)
sns.countplot(x = Train_data.invasive)
plt.xticks(np.arange(2), ['Yes', 'No'], fontsize=12)
plt.xlabel('Invesive Breast Cancer', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.show()

In [None]:
Train_data.groupby(['cancer','implant','invasive'])['cancer'].count()

In [None]:
nan

Next will explore on image data

In [None]:
nan