# TP Python Master IBM/RPM
## 1ère partie : introduction au langage Python
Février 2025

### Albertine Dubois - <span class="glyphicon glyphicon-envelope"></span> albertine.dubois@cea.fr et Marion Savanier - <span class="glyphicon glyphicon-envelope"></span> marion.savanier@universite-paris-saclay.fr

Découvrir comment manipuler des fichiers csv avec la librairie [pandas](https://pandas.pydata.org/docs/index.html)

# Breast cancer detection data set

## Why Pandas?

* Pandas is one of the most powerful libraries for anything related with data science
* Provides fast, flexible, easy and intuitive data manipulation of data frames
* Allows to work with big data sets and/or multiple files at once

## About Pandas
* `pandas` runs on top of `numpy`
* Provides high-level data structure (Data Frames): They look like spreadsheets
* Contains built-in functions to clean, group, merge, and concatenate tabular data
* Easy to apply `numpy` and `scipy` functions on tabular data

## Data description
source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Make sure you have the excel file in the rigth directory and start experimenting. Run the cells as they are first and then start implementing your own changes. The data comprises features from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image in the 3-dimensional space. There is no need to fully understand where the variables come from or their units, the important here is to explore what you can do with Pandas and understand how easy it is to apply these tools to any kind of data set. Hopefully this will motivate you to get one of your own data sets and start experimenting.

In [None]:
!git clone https://github.com/marsvn/PythonM2-jour1.git

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# load the data setting the path_to_file/filename and the type of separator (sep)
# attribute the output of the function (our table) to a variable (breast_cancer_data)
breast_cancer_data = pd.read_csv('/content/PythonM2-jour1/data/breast_cancer_diagnostic_data.csv', sep = ',')
breast_cancer_data

In [None]:
# to skip (100) rows
pd.read_csv('/content/PythonM2-jour1/data/breast_cancer_diagnostic_data.csv', sep = ',', skiprows = 100)

# if the dataset has no header
pd.read_csv('/content/PythonM2-jour1/data/breast_cancer_diagnostic_data.csv', header = None)

# if we want to change the index column
pd.read_csv('/content/PythonM2-jour1/data/breast_cancer_diagnostic_data.csv', sep = ',', index_col = 'radius_mean')

In [None]:
# checking the data header (output: list with all column names)
breast_cancer_data.columns

In [None]:
# checking first n elements of the table (output: list with first n lines)
breast_cancer_data.head(5)

In [None]:
# checking last n elements of the table (output: list with last n lines)
breast_cancer_data.tail(5)

In [None]:
# check the dimensions of our dataset (output: number of rows, number of columns)
breast_cancer_data.shape

In [None]:
# check a given column (output: list/array containg column values for all rows)
breast_cancer_data['area_mean']

In [None]:
# create a subset of the data (output: a subtable containing selected columns)
breast_cancer_data[['area_mean', 'perimeter_mean', 'texture_mean']]

In [None]:
# check the information of a given patient (output: select row number 100)
breast_cancer_data.iloc[100]

In [None]:
# get a slice of the table (output: a subtable containing selected rows)
breast_cancer_data.iloc[9:90]

In [None]:
# drop columns from the table (output: table without selected column)
# axis = 1 tells the function to apply the operation to the column
breast_cancer_data.drop('compactness_mean', axis = 1)

In [None]:
# drop lins from the table by index (output: table without row 5)
# axis = 0 tells the function to apply the operation to the rows
breast_cancer_data.drop(5, axis = 0)

In [None]:
# drop several lines (5-100) from the table by index
# axis = 0 tells the function to apply the operation to the rows
breast_cancer_data.drop(range(5,100), axis = 0)

In [None]:
# select slices of lines and columns simultaneously by their index
# left side of the comma concerns row indexes (slice lines 50 to 100)
# rigth side of the comma concerns column indexes (slice col 5 to 10)
breast_cancer_data.iloc[50:100, 5:10]

In [None]:
# for a table with only the first 100 lines and the first 3 columns
# we can omit the zero (first element)
breast_cancer_data.iloc[:100, :3]

In [None]:
# by extension, to select the first element of the first column
breast_cancer_data.iloc[0,0]

In [None]:
# to drop columns containing missing values (Nan)
breast_cancer_data.dropna(axis = 1)

In [None]:
# to drop lines containing missing values (Nan)
# (there is no lines containing missing values here)
breast_cancer_data.dropna(axis = 0)

In [None]:
# create a subtable containing all the rows and the first 7 columns
breast_cancer_data_subset = breast_cancer_data.iloc[:, :8]

# we discard the columns containing patient id (we don't need here)
breast_cancer_data_subset.drop('id', axis = 1, inplace = True)

# we rename the columns with better/shorter names
# we do this by passing a new list of names
breast_cancer_data_subset.columns = ['diagnosis', 'radius', 'texture', 'perimeter','area', 'smoothness', 'compactness']

# tables can be saved using .to_csv function
# several parameters can be used to control the shape of the table
breast_cancer_data_subset.to_csv('/content/my_new_table.txt', sep = '\t', index = False)

# and we have our new table
breast_cancer_data_subset

In [None]:
breast_cancer_data_subset.describe()

In [None]:
breast_cancer_data_subset.corr(numeric_only=True)

In [None]:
# check the mean of a given column
breast_cancer_data_subset['radius'].mean()

In [None]:
# check the standard deviation of a given column
breast_cancer_data_subset['radius'].std()

In [None]:
# check the maximum or the minimum value of a given column
breast_cancer_data_subset['radius'].max(), breast_cancer_data_subset['radius'].min()

In [None]:
# count number of elements of a given column
breast_cancer_data_subset['radius'].count()

In [None]:
# sum all values in a given column
breast_cancer_data_subset['radius'].sum()

In [None]:
# check the mean value of each column on the table
breast_cancer_data_subset.mean(numeric_only=True)

In [None]:
# these operations can also be applied to slices of data at once
# e.g sum the radius, perimeter and area of the tumor of each patient
breast_cancer_data_subset[['radius','perimeter','area']].sum(axis = 1)

In [None]:
# if we define this as a new parameter (dummy_col)
dummy_col = breast_cancer_data_subset[['radius','perimeter','area']].sum(axis = 1)

# we can add it to our table using the assign function
# sets the column name (string) and the variable that goes with it
breast_cancer_data_subset.assign(new_col = dummy_col)

In [None]:
# to replace values by other values dynamically we can use replace function
# to replace 'M' and 'B' by the full word 'Malign' and 'Benign', respectively:
breast_cancer_data_subset['diagnosis'] = breast_cancer_data_subset['diagnosis'].replace(to_replace = ['M','B'], value = ['Malign','Benign'])
breast_cancer_data_subset

In [None]:
# to answer the question "how many patients had a positive test result?"
# we use the value_counts() function: counts unique values in categorical variables
breast_cancer_data_subset['diagnosis'].value_counts()

In [None]:
# to sort our table according to a specific column we use sort_values()
breast_cancer_data_subset.sort_values(by = 'area', ascending = False)

In [None]:
# to build a scatter plot with to variables
breast_cancer_data_subset.plot(kind = 'scatter', x = 'radius', y = 'texture');

In [None]:
# by default pandas plots already look quite nice
# exploring the parameters we can create something even more fancy
breast_cancer_data_subset.plot(kind = 'scatter', x = 'radius', y = 'texture',  figsize = (6,5), alpha = 0.6,
                               color = 'blue', s = 80, edgecolor = 'w', label = 'Malign', grid = False)

In [None]:
# following the same rationale we can create a histogram for a given parameter
breast_cancer_data_subset.plot(kind = 'hist', y = 'radius', bins = 10,
                              color = 'lightblue', edgecolor = 'black', width = 1.5);

In [None]:
# we can also just select the column we want to plot
breast_cancer_data_subset['radius'].plot(kind = 'hist', edgecolor = 'black', width = 1.5)

In [None]:
# and even plot mutiple histograms at once
breast_cancer_data_subset[['radius','texture']].plot(kind = 'hist', bins = 15, alpha = 0.8, width = 1.8,
                                                     color = ['lightblue','orange'], edgecolor = 'black');

In [None]:
# we can also combine plot with the previous functions
# for instace, value_counts() with a barplot to visualize number of benign and malign cancers
breast_cancer_data_subset['diagnosis'].value_counts(normalize = True).plot(kind = 'bar', color = 'lightblue',
                                                                fontsize = 12, width = 0.5, edgecolor = 'blue', rot = 0)

In [None]:
# by convention
import seaborn as sns

In [None]:
# the correlation matrix that we created before
corr_table = breast_cancer_data_subset.corr(numeric_only=True)

# can be transformed into a heatmap with searborn's heatmap functon
sns.heatmap(corr_table)

In [None]:
# as for pandas, you can shape the design with a few parameters
sns.heatmap(breast_cancer_data_subset.corr(numeric_only=True), cmap = 'Blues', linewidth = 0.1)

In [None]:
# we can also plot a pairwise combination beween every two parameters
# using the pairplot function directly on our table
sns.pairplot(breast_cancer_data_subset)

In [None]:
# once again, only numerical data is shown
# but columns containing categorical data (diagnosis in our data)
# can be used as a parameter to group data into classes
sns.pairplot(breast_cancer_data_subset, hue = 'diagnosis')

In [None]:
# to create a simple boxplot with seaborn
# by default they already look super good
sns.boxplot(y = 'area', x = 'diagnosis', data = breast_cancer_data_subset,
            hue='diagnosis',palette = ['tomato','lightgreen'], width = 0.5)

# we can also add a striplot to make it more informative
sns.stripplot(y = 'area', x = 'diagnosis', data = breast_cancer_data_subset,
                 hue='diagnosis',palette = ['tomato','lightgreen'])

In [None]:
# and explore the function parameters to make more appealing
sns.boxplot(y = 'area', x = 'diagnosis', data = breast_cancer_data_subset,
            hue='diagnosis',palette = ['tomato','lightgreen'], width = 0.5)

sns.stripplot(y = 'area', x = 'diagnosis', data = breast_cancer_data_subset,
              color = 'black', alpha = 0.6, size = 4, edgecolor = 'w', linewidth = 0.5)