<h2>Initial steps for setup</h2>

In [1]:
#importing pandas
import pandas as pd

In [2]:
#reading the csv into pandas dataframe
textile_data=pd.read_csv('Plastic based Textiles in clothing industry.csv')

In [3]:
#first 5 rows
textile_data.head()

Unnamed: 0,Company,Product_Type,Production_Year,Greenhouse_Gas_Emissions,Pollutants_Emitted,Water_Consumption,Energy_Consumption,Waste_Generation,Sales_Revenue
0,Zara,Polyester,2020,5000,20,7500,1200,300,500000
1,Zara,Nylon,2019,3000,15,5000,900,200,450000
2,Zara,Recycled_Poly,2021,3500,18,6000,1100,250,480000
3,Zara,Cotton,2018,2000,10,4500,800,180,550000
4,Zara,Synthetic_Blend,2022,6000,25,8000,1500,350,600000


In [4]:
#number of rows in the dataset
len(textile_data)

6956

<h2> Removing Nulls & Duplicates </h2>

In [5]:
#checking to find any null values
textile_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6956 entries, 0 to 6955
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Company                   6956 non-null   object
 1   Product_Type              6956 non-null   object
 2   Production_Year           6956 non-null   int64 
 3   Greenhouse_Gas_Emissions  6956 non-null   int64 
 4   Pollutants_Emitted        6956 non-null   int64 
 5   Water_Consumption         6956 non-null   int64 
 6   Energy_Consumption        6956 non-null   int64 
 7   Waste_Generation          6956 non-null   int64 
 8   Sales_Revenue             6956 non-null   int64 
dtypes: int64(7), object(2)
memory usage: 489.2+ KB


Looking at the columns, there are no null values in any of the columns since all of them have the same values as the number of rows

In [6]:
#confirmation of lack of null values
textile_data.isnull().sum()

Company                     0
Product_Type                0
Production_Year             0
Greenhouse_Gas_Emissions    0
Pollutants_Emitted          0
Water_Consumption           0
Energy_Consumption          0
Waste_Generation            0
Sales_Revenue               0
dtype: int64

Findings are consistent, no null values

In [7]:
#dropping duplicate rows
textile_data=textile_data[textile_data.duplicated()==False]

In [8]:
len(textile_data)

6956

No duplicate rows were found, since the number of rows in the dataset are the same

<h2> Exploring Data </h2><br>
In order to see what the distribution is like

In [9]:
#checking the range of numeric data
textile_data.describe()

Unnamed: 0,Production_Year,Greenhouse_Gas_Emissions,Pollutants_Emitted,Water_Consumption,Energy_Consumption,Waste_Generation,Sales_Revenue
count,6956.0,6956.0,6956.0,6956.0,6956.0,6956.0,6956.0
mean,2020.00115,3891.2523,16.992812,5983.972542,1106.555204,248.891748,510234.839275
std,1.415026,1219.957903,4.913978,1153.555849,232.108015,57.75123,51775.324086
min,2018.0,1800.0,9.0,4000.0,700.0,150.0,420000.0
25%,2019.0,2837.75,13.0,5005.0,906.0,198.0,465467.75
50%,2020.0,3876.5,17.0,5943.0,1112.0,249.0,510494.0
75%,2021.0,4968.0,21.0,6991.25,1308.0,299.0,555122.75
max,2022.0,6000.0,25.0,8000.0,1500.0,350.0,600000.0


In [10]:
company_counts=pd.DataFrame(textile_data['Company'].value_counts())
company_counts

Unnamed: 0_level_0,count
Company,Unnamed: 1_level_1
Nike,1444
Zara,1396
Urban Outfitters,1390
Adidas,1376
Forever 21,1350


In [11]:
#checking the distribution of rows amongst the companies
company_counts['Percentage_of_share']=company_counts['count']/len(textile_data)
company_counts

Unnamed: 0_level_0,count,Percentage_of_share
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Nike,1444,0.207591
Zara,1396,0.20069
Urban Outfitters,1390,0.199827
Adidas,1376,0.197815
Forever 21,1350,0.194077


Each company has a similar share in the dataset, there is no disproportionate representation

<h2> Preprocessing for analysis and visualization</h2> <br>


In [12]:
#making the column names visualization friendly by removing the underscore
textile_data.columns = [colname.replace('_', ' ') for colname in textile_data.columns]
textile_data.head()

Unnamed: 0,Company,Product Type,Production Year,Greenhouse Gas Emissions,Pollutants Emitted,Water Consumption,Energy Consumption,Waste Generation,Sales Revenue
0,Zara,Polyester,2020,5000,20,7500,1200,300,500000
1,Zara,Nylon,2019,3000,15,5000,900,200,450000
2,Zara,Recycled_Poly,2021,3500,18,6000,1100,250,480000
3,Zara,Cotton,2018,2000,10,4500,800,180,550000
4,Zara,Synthetic_Blend,2022,6000,25,8000,1500,350,600000


In [13]:
#viewing product type because it has the underscore character, let's check if it has any other special characters
textile_data['Product Type'].value_counts()

Product Type
Linen              686
Organic_Cotton     673
Polyester          665
Microfiber         653
Nylon              637
Recycled_Poly      630
Viscose            623
Cotton             613
Wool               611
Tencel             583
Synthetic_Blend    582
Name: count, dtype: int64

In [14]:
#replace the underscore by a space
textile_data['Product Type']=[val.replace('_',' ') for val in textile_data['Product Type']]
textile_data.head()

Unnamed: 0,Company,Product Type,Production Year,Greenhouse Gas Emissions,Pollutants Emitted,Water Consumption,Energy Consumption,Waste Generation,Sales Revenue
0,Zara,Polyester,2020,5000,20,7500,1200,300,500000
1,Zara,Nylon,2019,3000,15,5000,900,200,450000
2,Zara,Recycled Poly,2021,3500,18,6000,1100,250,480000
3,Zara,Cotton,2018,2000,10,4500,800,180,550000
4,Zara,Synthetic Blend,2022,6000,25,8000,1500,350,600000


In [15]:
#export cleaned data to csv
textile_data.to_csv("Cleaned Dataset.csv")

<h2>Modifying Data Format for Visualization</h2>

<h3> For visualizations 1 & 2: numeric values over the years, creating a dataset containing numeric paarmeters per year for each brand, except product type</h3>

In [16]:
company_year_data=textile_data.groupby(['Production Year','Company']).sum(numeric_only=True)
company_year_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Greenhouse Gas Emissions,Pollutants Emitted,Water Consumption,Energy Consumption,Waste Generation,Sales Revenue
Production Year,Company,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018,Adidas,1044326,4495,1598015,291479,68006,133652623
2018,Forever 21,1138428,4812,1683287,310153,72189,144650668
2018,Nike,1168119,5036,1724813,322613,71903,149306143
2018,Urban Outfitters,1027296,4458,1596444,288829,64648,130937736
2018,Zara,1108090,4955,1708959,318237,71528,145638397
2019,Adidas,1039273,4608,1614774,299960,67726,138442743
2019,Forever 21,1146260,5028,1745852,330501,73609,151774036
2019,Nike,1137905,5098,1749653,331199,71891,149541222
2019,Urban Outfitters,1091130,4713,1678043,311365,69264,143944554
2019,Zara,1013600,4499,1585130,291225,64243,134701549


In [17]:
company_year_data.to_csv('Flourish Viz 1 and 2.csv')

<h3> For visualizations 3: Relation between Product Type and Greenhouse emissions

In [18]:
product_type_data=textile_data.groupby(['Company','Product Type']).sum(numeric_only=True).drop(['Production Year','Sales Revenue'],axis=1)
product_type_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Greenhouse Gas Emissions,Pollutants Emitted,Water Consumption,Energy Consumption,Waste Generation
Company,Product Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adidas,Cotton,439255,1899,631087,120999,28416
Adidas,Linen,555228,2534,867638,167505,37167
Adidas,Microfiber,452320,1982,680268,122534,28828
Adidas,Nylon,467970,2021,670818,132740,30073
Adidas,Organic Cotton,551349,2309,812275,148933,35591
Adidas,Polyester,584509,2444,869793,159773,36125
Adidas,Recycled Poly,480943,2031,757806,139722,31465
Adidas,Synthetic Blend,461289,1799,668253,128688,27507
Adidas,Tencel,426001,1878,671977,116510,26025
Adidas,Viscose,509177,2244,806401,149888,34090


In [19]:
product_type_data.to_csv('Flourish Viz 3.csv')