# Introduction

In this study the authors looked at band measurements on the shells of snails. They investigated the placement and interaction between the banding patterns on the shells of 440 individual snails, 271 Cepaea nemoralis and 169 Cepaea hortensis, across 40 populations, distributed throughout the UK and mainland Europe. The data set I chose from this paper is looking at 86 snails from the 440 snails, the authors looked at the colour, banding, and shape of the shells for 86 of the snails used in the study.

# Project Part 3
# 3.1
1. Obtain the data file from the link provided below to the paper's supplementary information Tables S1-S3 https://pmc.ncbi.nlm.nih.gov/articles/instance/8207382/bin/ECE3-11-6634-s002.xlsx

2. Open the downloaded file in Excel then save Supplementary Table 3, the third tab in the excel file, save Supplementary Table 3 as a csv. Save the file as "snail_shells.csv" in the same directory where the notebook is located.

In [None]:
import pandas as pd
snail_shells = pd.read_csv("snail_shells.csv", usecols=['population','id','location','age','colour','banding','width (mm)','height (mm)','weight (g)','shape (height/width)'], skiprows=2, nrows=86)
snail = pd.DataFrame(snail_shells)
snail.index.name="snail"
snail.head()
#To import the data pandas needs to be installed, with pandas the data file that was saved as a csv can be read into the notebook.
#As the data file is being read in, the data is cleaned before making it into a data frame.
#The data is cleaned by removing the rows and columns that had NaN, or empty spaces by specifying the columns and rows that were to be kept. 
#Once the data is cleaned it will be saved into a data frame using pandas and the dataframe will be saved as "snail".
#Set the index name of the snail data frame to be "snail" and look at the first 5 rows of the data frame.

# 3.2

In [None]:
snail.shape

This data frame contains 86 rows which corresponds to the number of snails 86 that were selected for this data set from the 440 snails observed in the study with the 10 features of the snail shells corresponding to the 10 columns.
1. The shape (height/width) of the snail shells and the location of the snails are two features I will investigate from this data set.
   
2. For the shape (height/width) of the snails, the "describe" function would show a table of the snail shape (height/width) showing the average around 0.78, the median would be around 0.77, the minimum around 0.7, the maximum is about 1, the count of how many snails shape that were observed is 86 and the standard deviation. There are 86 observations for the shape (height/width) of the snail shells and the average is around 0.78.

   
3. For the location feature, the "describe" function would be like describing the genotype from the markers in the Arabidopsis data, it would not have a table like the shape (height/width) feature showing a mean, min or max. The location feature would have the counts, how many locations, the top location, and how often the top location is seen in the data. The location feature would have 86 observations and would not have an average between the two locations, but the Jueu location occurs more often and accounts for more than half of the observations.

In [None]:
print(snail['shape (height/width)'].describe())
snail['location'].describe()

1. Both features have 86 observations as predicted above based on the data set and location and shape (height/width) features chosen.
   
2. The "describe" function for the shape (height/width) feature has an output of a table and the average was 0.771640 and I predicted around 0.78 from looking at the 86 values that ranged from 0.702 to 1.02 with a median of 0.765 in shape.
   
3. The "describe" function for the locations features as predicted did not give a table with the average, median, minimum and maximum values but the unique values corresponding to the two locations and the most frequent location was Jueu with 53 snails being found in that location out of the 86.

# 3.3

3.3.1
- For the shape (height/width) feature I would make a histogram by using the displot option in the seaborn package and look at the distribution for the shape of the snail shells by the banding pattern feature observed in the data set. A histogram would be able to visually quantify how many times that shape of the snail shell was observed over the 86 snails. The histogram will show the distribution of the snail shell shapes with banding patterns to see if shape influences the type of banding patterns seen.

In [None]:
import seaborn as sns
sns.displot(
    data=snail,
    x='shape (height/width)',
    hue='banding')
#To make a distribution plot you first have to install the seaborn package
#Then make a histogram for the shape (height/width) feature by using displot with the data "snails", the shape feature will be on the x-axis and then use the banding patterns observed for subcategorization of the snail shell shape.

- The histogram above shows the distribution of the shape of the snail shells ranging from the minimum 0.70 and maximum 1.02 and what banding pattern falls into that shape of the snail shell. The median falls into the bin for 0.76 and 0.78 as seen in the describe function the 50% is at 0.765. The histogram shows that the shape of the snail shells for both the 10345 and 12345 banding pattern are highly observed between 0.765 and 0.81.
   
- The shape of the snail shells relates to the table/figure in the paper, Tables 1 and 2 which uses statistical testing to see the impact of shell shape, height, or weight relative to the width of the bands of the banding patterns.



3.3.2
- For the location feature I would make a strip plot from the catplot options in seaborn and compare against the shape and banding pattern features.
  
- A categorical plot like the strip plot will plot a number feature (shape) and a categorical feature (location) together and show a comparison similar to that of a histogram. The strip plot will show a comparison for the location of the snails and looking at the shape and banding pattern would give insight into what location may have a higher prevalence of a banding pattern type and if shape differs between location depending on the banding pattern.

In [None]:
sns.catplot(data=snail,x='location',y='shape (height/width)',hue='banding')
#Seaborn has a categorical plot option for plotting, the strip plot option from seaborn is the default for catplot and jitter will be used to distinguish between the data points.
#Using the snail data the location feature is plotted against the shape (height/width) feature and the banding pattern will be used for subcatergorization of the snail locations.

- The strip plot above shows Vielha and Jueu the two locations, noted as unique under the 'describe' function with Jueu as the top location which has the most frequent observations. The Jueu location has observations for the three types of banding patterns on the snails shells, but this set of data mostly has data for the 12345 banding pattern.
  
- The location of the snails is similar to Figure 1 and 5 from the paper, Figure 1 shows the banding pattern phenotypes observed in the study while Figure 5 shows a comparison between the two species in the study for just the 12345 phenotype and its looking at the differences between each band.