# Classifying Book Author's gender by name (approximately)

Used dataset: https://www.kaggle.com/datasets/gracehephzibahm/gender-by-name/data

In our research project aimed at analyzing the representation of female book authors on WikiData, the initial step involves attempting to deduce the gender of authors based on their names. This process is motivated by the goal of categorizing authors into distinct groups —
- male authors,
- female authors,
- and those whose gender is inconclusive or if a book is associated with collective works.

By doing so, we aim to gain insights into the gender distribution within the dataset, allowing us to later explore the coverage of these authors on WikiData.

However, it's crucial to acknowledge that gender is a complex and subjective aspect of individual identity. In a proper and respectful context, one should always ask a person about their gender identity. In our case, we recognize the limitations of deducing gender solely based on names and emphasize that our approach serves as a coarse approximation. We understand that gender identity is deeply personal and can't be accurately determined through this method. Our intent is to use name matching as a preliminary step to explore broad patterns in our dataset and initiate a meaningful analysis of gender representation. We approach this with sensitivity and respect for the diverse identities of individuals.

In [1]:
%cd ..

C:\Users\jplak\projects\women_books


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


## 0. Import Data

In [68]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

In [69]:
df_names = pd.read_csv("./data/gender_by_name_dataset.csv")

In [70]:
df_books = pd.read_parquet("./datasets/2024-01-19_books.parquet")

In [137]:
df_books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,"[Mark, Morford]",[],[Mark],[]
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,"[Richard, Bruce, Wright]",[],"[Richard, Bruce, Wright]",[]
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,"[Carlo, D'Este]",[],[Carlo],[]
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,"[Gina, Bari, Kolata]","[Gina, Bari]",[],[]
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,[Barber],[],[Barber],[]
...,...,...,...,...,...,...,...,...,...,...
271374,440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,"[Paula, Danziger]",[Paula],[],[]
271375,525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,"[Teri, Sloat]",[Teri],[],[]
271376,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,"[Christine, Wicker]",[Christine],[],[]
271377,192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,[Plato],[],[Plato],[]


## 1. Name dataset EDA and preprocessing

In [71]:
df_names

Unnamed: 0,Name,Gender,Count,Probability
0,James,M,5304407,1.451679e-02
1,John,M,5260831,1.439753e-02
2,Robert,M,4970386,1.360266e-02
3,Michael,M,4579950,1.253414e-02
4,William,M,4226608,1.156713e-02
...,...,...,...,...
147264,Zylenn,M,1,2.736740e-09
147265,Zymeon,M,1,2.736740e-09
147266,Zyndel,M,1,2.736740e-09
147267,Zyshan,M,1,2.736740e-09


In [72]:
df_names["Gender"].value_counts()

Gender
F    89749
M    57520
Name: count, dtype: int64

In [73]:
df_female_names[
    (~df_female_names["Name"].str.isalpha())
    & (~df_female_names["Name"].str.contains("-"))
]

Unnamed: 0,Name,Gender,Count,Probability
27085,Chlo…,F,247,6.759750e-07
31858,Am…Lie,F,180,4.926130e-07
42282,Esm…,F,94,2.572540e-07
55930,Ku.,F,43,1.176800e-07
59234,Ren…E,F,36,9.852270e-08
...,...,...,...,...
133622,Yoshoda@,F,1,2.736740e-09
133673,Za'Quiesha,F,1,2.736740e-09
133782,Ze'Eva,F,1,2.736740e-09
133888,Zo'E,F,1,2.736740e-09


In [74]:
df_female_names[
    df_female_names["Name"].str.endswith(".")
]

Unnamed: 0,Name,Gender,Count,Probability
55930,Ku.,F,43,1.1768e-07
65877,Km.,F,25,6.84185e-08
93539,Smts.,F,6,1.64204e-08
101918,Miss.,F,5,1.36837e-08
101985,Ms.,F,5,1.36837e-08
110755,Mrs.,F,4,1.0947e-08
111446,Kr.,F,3,8.21022e-09
113109,Kum.,F,2,5.47348e-09
117455,B.,F,1,2.73674e-09
123070,K.,F,1,2.73674e-09


In [75]:
df_names[df_names["Name"] == "Mark"]

Unnamed: 0,Name,Gender,Count,Probability
23,Mark,M,1410637,0.003861
4216,Mark,F,4668,1.3e-05


In [81]:
name_and_count: pd.Series = df_names.groupby("Name")["Count"].count()
single_sex_names_mask = name_and_count == 1

In [82]:
single_sex_names = name_and_count[single_sex_names_mask]

In [83]:
df_names_single_sex = df_names[df_names["Name"].isin(single_sex_names.index)].copy()
df_names_mixed_sex = df_names[~df_names["Name"].isin(single_sex_names.index)].copy()

In [84]:
df_female_names = df_names_single_sex[df_names_single_sex["Gender"] == "F"].copy()  
df_male_names = df_names_single_sex[df_names_single_sex["Gender"] == "M"].copy()  

In [85]:
print(f'Количество женских имён: {len(df_female_names)}')
print(f'Количество мужских имён: {len(df_male_names)}')

Количество женских имён: 76390
Количество мужских имён: 44161


In [88]:
print(df_female_names['Name'].isin(df_male_names['Name']).any())
print(df_male_names['Name'].isin(df_female_names['Name']).any())

False
False


In [89]:
df_female_names

Unnamed: 0,Name,Gender,Count,Probability
1180,Delilah,F,40776,1.115930e-04
1312,Brielle,F,34420,9.419860e-05
1348,Helene,F,33317,9.118000e-05
1371,Alina,F,32527,8.901800e-05
1397,Lyla,F,31401,8.593640e-05
...,...,...,...,...
133945,Zyika,F,1,2.736740e-09
133946,Zymeliah,F,1,2.736740e-09
133947,Zyrije,F,1,2.736740e-09
133948,Zyrinah,F,1,2.736740e-09


In [38]:
df_male_names

Unnamed: 0,Name,Gender,Count,Probability
1378,Jakob,M,32249,8.825720e-05
1669,Jefferson,M,22397,6.129480e-05
1683,Rodger,M,21922,5.999480e-05
1764,Romeo,M,19941,5.457330e-05
1782,Vicente,M,19616,5.368390e-05
...,...,...,...,...
147264,Zylenn,M,1,2.736740e-09
147265,Zymeon,M,1,2.736740e-09
147266,Zyndel,M,1,2.736740e-09
147267,Zyshan,M,1,2.736740e-09


In [16]:
df_names_mixed_sex

Unnamed: 0,Name,Gender,Count,Probability
0,James,M,5304407,1.451679e-02
1,John,M,5260831,1.439753e-02
2,Robert,M,4970386,1.360266e-02
3,Michael,M,4579950,1.253414e-02
4,William,M,4226608,1.156713e-02
...,...,...,...,...
147222,Ziyuan,M,1,2.736740e-09
147224,Zlata,M,1,2.736740e-09
147243,Zowie,M,1,2.736740e-09
147246,Zozo,M,1,2.736740e-09


In [17]:
df_names_mixed_sex.sort_values('Name').head(30)

Unnamed: 0,Name,Gender,Count,Probability
112246,A,F,2,5.47348e-09
114150,A,M,2,5.47348e-09
96227,Aaden,F,5,1.36837e-08
4119,Aaden,M,4877,1.33471e-05
12909,Aadi,M,857,2.34539e-06
73983,Aadi,F,16,4.37879e-08
73985,Aadyn,F,16,4.37879e-08
17807,Aadyn,M,516,1.41216e-06
29407,Aalijah,M,212,5.80189e-07
34656,Aalijah,F,150,4.10511e-07


In [90]:
print(len(df_names_mixed_sex))

26718


In [91]:
df_names_mixed_sex['Name'].nunique()

13359

In [92]:
name_groups = df_names_mixed_sex.groupby('Name')

In [93]:
name_groups.get_group('Yury')

Unnamed: 0,Name,Gender,Count,Probability
28638,Yury,F,223,6.10293e-07
71547,Yury,M,19,5.19981e-08


In [94]:
mainly_female_name_buffer = set()
mainly_male_name_buffer = set()
inconclusive_name_buffer = set()
for name in tqdm(df_names_mixed_sex['Name'].unique()):
    subframe = name_groups.get_group(name)
    count_male = subframe[subframe['Gender'] == 'M']['Count'].item()
    count_female = subframe[subframe['Gender'] == 'F']['Count'].item()
    share = count_male/count_female - 1
    if share > 0.3:
        mainly_male_name_buffer.add(name)
    elif share < -0.3:
        mainly_female_name_buffer.add(name)
    else: 
        inconclusive_name_buffer.add(name)

100%|██████████████████████████████████████████████████████████████████████████| 13359/13359 [00:06<00:00, 2036.99it/s]


In [101]:
print(mainly_female_name_buffer.isdisjoint(mainly_male_name_buffer))
print(mainly_female_name_buffer.isdisjoint(inconclusive_name_buffer))
print(inconclusive_name_buffer.isdisjoint(mainly_male_name_buffer))

True
True
True


In [102]:
df_names_single_sex['Name'].isin(mainly_female_name_buffer).any()

False

In [103]:
print(f'Количество однозначно женских имён: {len(df_female_names)}')
print(f'Количество однозначно мужских имён: {len(df_male_names)}')
print(f'Количество женских имён, определённых как женские: {len(mainly_female_name_buffer)}')
print(f'Количество мужских имён, определённых как мужские: {len(mainly_male_name_buffer)}')
print(f'Количество смешанных имён: {len(inconclusive_name_buffer)}')


Количество однозначно женских имён: 76390
Количество однозначно мужских имён: 44161
Количество женских имён, определённых как женские: 6148
Количество мужских имён, определённых как мужские: 6112
Количество смешанных имён: 1099


## 2. Classification through exact match

In a proper way, gender is subjective and should be asked of a person. In our research, for the purpose of a coarse approximation, we utilize matching by names. We acknowledge the limitations of this approach and emphasize that our intent is not to make assumptions about individuals' gender identities but rather to explore patterns within the dataset. We respect the personal nature of gender identity and encourage a nuanced understanding of this complex aspect of individual expression.

In [104]:
female_names_set = set(df_female_names["Name"]).union(mainly_female_name_buffer)

In [105]:
male_names_set = set(df_male_names["Name"]).union(mainly_male_name_buffer)

In [106]:
print(female_names_set.isdisjoint(male_names_set))

True


In [107]:
df_books["name_chunks"] = df_books["Book-Author"].str.split()

In [141]:
'PHILIP'.title()

'Philip'

In [144]:
def filter_chunks(chunk_list):
    return [
        el.title()       #coerse text case 
        for el in chunk_list 
        if not (
            el.endswith(".")
            and len(el) <= 2
        )
    ]

In [151]:
filter_chunks(["A.", "Miss.", "PHILIP", "PHILIP-JANE", "D'ARTANjaN"])

['Miss.', 'Philip', 'Philip-Jane', "D'Artanjan"]

In [152]:
df_books["name_chunks"] = df_books["name_chunks"].progress_apply(filter_chunks)

100%|██████████████████████████████████████████████████████████████████████| 271379/271379 [00:00<00:00, 296172.51it/s]


In [153]:
df_books["name_chunks"]

0                  [Mark, Morford]
1         [Richard, Bruce, Wright]
2                  [Carlo, D'Este]
3             [Gina, Bari, Kolata]
4                         [Barber]
                    ...           
271374           [Paula, Danziger]
271375               [Teri, Sloat]
271376         [Christine, Wicker]
271377                     [Plato]
271378       [Christopher, Biffle]
Name: name_chunks, Length: 271379, dtype: object

In [154]:
def check_in_set(chunks: list[str], names=female_names_set) -> list[str]:
    return [name for name in chunks if name in names]

In [155]:
df_books["matched_female"] = df_books["name_chunks"].progress_apply(check_in_set)

100%|█████████████████████████████████████████████████████████████████████| 271379/271379 [00:00<00:00, 1038080.59it/s]


In [156]:
df_books[df_books["matched_female"].apply(bool)]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,"[Gina, Bari, Kolata]","[Gina, Bari]",[],[]
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,"[Amy, Tan]",[Amy],[Tan],[]
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,"[Ann, Beattie]",[Ann],[Beattie],[]
12,887841740,The Middle Stories,Sheila Heti,2004,House of Anansi Press,http://images.amazon.com/images/P/0887841740.0...,"[Sheila, Heti]",[Sheila],[],[]
17,1881320189,Goodbye to the Buttermilk Sky,Julia Oliver,1994,River City Pub,http://images.amazon.com/images/P/1881320189.0...,"[Julia, Oliver]",[Julia],[Oliver],[]
...,...,...,...,...,...,...,...,...,...,...
271365,395264707,Dreamsnake,Vonda N. McIntyre,1978,Houghton Mifflin,http://images.amazon.com/images/P/0395264707.0...,"[Vonda, Mcintyre]",[Vonda],[],[]
271373,449906736,Flashpoints: Promise and Peril in a New World,Robin Wright,1993,Ballantine Books,http://images.amazon.com/images/P/0449906736.0...,"[Robin, Wright]",[Robin],[Wright],[]
271374,440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,"[Paula, Danziger]",[Paula],[],[]
271375,525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,"[Teri, Sloat]",[Teri],[],[]


In [157]:
df_books["matched_male"] = df_books["name_chunks"].progress_apply(lambda x: check_in_set(x, male_names_set))

100%|█████████████████████████████████████████████████████████████████████| 271379/271379 [00:00<00:00, 1048285.32it/s]


In [158]:
df_books[df_books["matched_male"].apply(bool)]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,"[Mark, Morford]",[],[Mark],[]
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,"[Richard, Bruce, Wright]",[],"[Richard, Bruce, Wright]",[]
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,"[Carlo, D'Este]",[],[Carlo],[]
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,[Barber],[],[Barber],[]
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,"[Amy, Tan]",[Amy],[Tan],[]
...,...,...,...,...,...,...,...,...,...,...
271371,1845170423,Cocktail Classics,David Biggs,2004,Connaught,http://images.amazon.com/images/P/1845170423.0...,"[David, Biggs]",[],[David],[]
271372,014002803X,Anti Death League,Kingsley Amis,1975,Viking Press,http://images.amazon.com/images/P/014002803X.0...,"[Kingsley, Amis]",[],"[Kingsley, Amis]",[]
271373,449906736,Flashpoints: Promise and Peril in a New World,Robin Wright,1993,Ballantine Books,http://images.amazon.com/images/P/0449906736.0...,"[Robin, Wright]",[Robin],[Wright],[]
271377,192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,[Plato],[],[Plato],[]


In [159]:
df_books["matched_inconclusive"] = df_books["name_chunks"].progress_apply(lambda x: check_in_set(x, inconclusive_name_buffer))

100%|██████████████████████████████████████████████████████████████████████| 271379/271379 [00:00<00:00, 365365.69it/s]


In [160]:
df_books[df_books["matched_inconclusive"].apply(bool)].head(30)

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
105,067976397X,Corelli's Mandolin : A Novel,LOUIS DE BERNIERES,1995,Vintage,http://images.amazon.com/images/P/067976397X.0...,"[Louis, De, Bernieres]",[],[Louis],[De]
257,60563079,Peter Pan: The Original Story (Peter Pan),J. M. Barrie,2003,HarperFestival,http://images.amazon.com/images/P/0060563079.0...,[Barrie],[],[],[Barrie]
267,312890044,Moonheart (Newford),Charles de Lint,1994,Orb Books,http://images.amazon.com/images/P/0312890044.0...,"[Charles, De, Lint]",[],[Charles],[De]
388,156528207,The Little Prince,Antoine de Saint-ExupГ©ry,1968,Harcourt,http://images.amazon.com/images/P/0156528207.0...,"[Antoine, De, Saint-Exupг©Ry]",[],[Antoine],[De]
417,802114369,Ohitika Woman,Mary Brave Bird,1993,Pub Group West,http://images.amazon.com/images/P/0802114369.0...,"[Mary, Brave, Bird]",[Mary],[Brave],[Bird]
446,60804157,He Understanding Masculine Psychology,Robert A Johnson,1977,Harpercollins Publisher,http://images.amazon.com/images/P/0060804157.0...,"[Robert, A, Johnson]",[],"[Robert, Johnson]",[A]
467,870441663,The wild ponies of Assateague Island (Books fo...,Donna K Grosvenor,1975,National Geographic Society,http://images.amazon.com/images/P/0870441663.0...,"[Donna, K, Grosvenor]",[Donna],[Grosvenor],[K]
489,553285343,"RUSSIA HOUSE, THE",JOHN LE CARRE,1990,Bantam,http://images.amazon.com/images/P/0553285343.0...,"[John, Le, Carre]",[Carre],[John],[Le]
514,26329859,From the Earth: Chinese Vegetarian Cooking,Eileen Yin-Fei Lo,1995,MacMillan Publishing Company.,http://images.amazon.com/images/P/0026329859.0...,"[Eileen, Yin-Fei, Lo]",[Eileen],[],[Lo]
627,440441501,In the Dinosaur's Paw (Kids of the Polk Street...,PATRICIA REILLY GIFF,1987,Yearling,http://images.amazon.com/images/P/0440441501.0...,"[Patricia, Reilly, Giff]",[Patricia],[],[Reilly]


In [161]:
only_female_matches = (df_books['matched_female'].apply(len) > 0 ) & (df_books['matched_male'].apply(len) == 0) 
df_female_books = df_books[only_female_matches].copy()

In [162]:
df_female_books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,"[Gina, Bari, Kolata]","[Gina, Bari]",[],[]
12,887841740,The Middle Stories,Sheila Heti,2004,House of Anansi Press,http://images.amazon.com/images/P/0887841740.0...,"[Sheila, Heti]",[Sheila],[],[]
24,439095026,Tell Me This Isn't Happening,Robynn Clairday,1999,Scholastic,http://images.amazon.com/images/P/0439095026.0...,"[Robynn, Clairday]",[Robynn],[],[]
33,3442353866,Der Fluch der Kaiserin. Ein Richter- Di- Roman.,Eleanor Cooney,2001,Goldmann,http://images.amazon.com/images/P/3442353866.0...,"[Eleanor, Cooney]",[Eleanor],[],[]
38,449005615,Seabiscuit: An American Legend,LAURA HILLENBRAND,2002,Ballantine Books,http://images.amazon.com/images/P/0449005615.0...,"[Laura, Hillenbrand]",[Laura],[],[]
...,...,...,...,...,...,...,...,...,...,...
271364,684860112,Driving to Detroit: Memoirs of a Fast Woman,Lesley Hazleton,1999,Simon &amp; Schuster (Trade Division),http://images.amazon.com/images/P/0684860112.0...,"[Lesley, Hazleton]",[Lesley],[],[]
271365,395264707,Dreamsnake,Vonda N. McIntyre,1978,Houghton Mifflin,http://images.amazon.com/images/P/0395264707.0...,"[Vonda, Mcintyre]",[Vonda],[],[]
271374,440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,"[Paula, Danziger]",[Paula],[],[]
271375,525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,"[Teri, Sloat]",[Teri],[],[]


In [163]:
only_male_matches = (df_books['matched_female'].apply(len) == 0 ) & (df_books['matched_male'].apply(len) > 0) 
df_male_books = df_books[only_male_matches].copy()

In [164]:
df_male_books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,"[Mark, Morford]",[],[Mark],[]
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,"[Richard, Bruce, Wright]",[],"[Richard, Bruce, Wright]",[]
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,"[Carlo, D'Este]",[],[Carlo],[]
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,[Barber],[],[Barber],[]
6,425176428,What If?: The World's Foremost Military Histor...,Robert Cowley,2000,Berkley Publishing Group,http://images.amazon.com/images/P/0425176428.0...,"[Robert, Cowley]",[],[Robert],[]
...,...,...,...,...,...,...,...,...,...,...
271370,1582380805,Tropical Rainforests: 230 Species in Full Colo...,"Allen M., Ph.D. Young",2001,Golden Guides from St. Martin's Press,http://images.amazon.com/images/P/1582380805.0...,"[Allen, M.,, Ph.D., Young]",[],"[Allen, Young]",[]
271371,1845170423,Cocktail Classics,David Biggs,2004,Connaught,http://images.amazon.com/images/P/1845170423.0...,"[David, Biggs]",[],[David],[]
271372,014002803X,Anti Death League,Kingsley Amis,1975,Viking Press,http://images.amazon.com/images/P/014002803X.0...,"[Kingsley, Amis]",[],"[Kingsley, Amis]",[]
271377,192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,[Plato],[],[Plato],[]


In [165]:
unsorted_books = (~df_books['ISBN'].isin(df_female_books['ISBN'])) & (~df_books['ISBN'].isin(df_male_books['ISBN']))
df_unsorted_books = df_books[unsorted_books].copy()

In [169]:
df_unsorted_books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,"[Amy, Tan]",[Amy],[Tan],[]
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,"[Ann, Beattie]",[Ann],[Beattie],[]
17,1881320189,Goodbye to the Buttermilk Sky,Julia Oliver,1994,River City Pub,http://images.amazon.com/images/P/1881320189.0...,"[Julia, Oliver]",[Julia],[Oliver],[]
19,452264464,Beloved (Plume Contemporary Fiction),Toni Morrison,1994,Plume,http://images.amazon.com/images/P/0452264464.0...,"[Toni, Morrison]",[Toni],[Morrison],[]
21,1841721522,New Vegetarian: Bold and Beautiful Recipes for...,Celia Brooks Brown,2001,Ryland Peters &amp; Small Ltd,http://images.amazon.com/images/P/1841721522.0...,"[Celia, Brooks, Brown]",[Celia],"[Brooks, Brown]",[]
...,...,...,...,...,...,...,...,...,...,...
271347,471915645,Introductory Digital Signal Processing with Co...,Paul A. Lynn,1989,John Wiley &amp; Sons,http://images.amazon.com/images/P/0471915645.0...,"[Paul, Lynn]",[Lynn],[Paul],[]
271349,3320016822,Urteil ohne Prozess: Margot Honecker gegen Oss...,JГ¶rn Kalkbrenner,1990,Dietz,http://images.amazon.com/images/P/3320016822.0...,"[Jг¶Rn, Kalkbrenner]",[],[],[]
271352,3525335423,Das Deutsche Kaiserreich 1871-1918.,Hans-Ulrich Wehler,1994,Vandenhoeck &amp; Ruprecht,http://images.amazon.com/images/P/3525335423.0...,"[Hans-Ulrich, Wehler]",[],[],[]
271355,3893312307,Die Vereinten Nationen: Zwischen Anspruch und ...,GГјnther Unser,1995,Bundeszentrale fГјr Politische Bildung,http://images.amazon.com/images/P/3893312307.0...,"[Gгјnther, Unser]",[],[],[]


In [167]:
df_unsorted_books[(df_unsorted_books['matched_inconclusive'].apply(len) > 0 )]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
257,60563079,Peter Pan: The Original Story (Peter Pan),J. M. Barrie,2003,HarperFestival,http://images.amazon.com/images/P/0060563079.0...,[Barrie],[],[],[Barrie]
417,802114369,Ohitika Woman,Mary Brave Bird,1993,Pub Group West,http://images.amazon.com/images/P/0802114369.0...,"[Mary, Brave, Bird]",[Mary],[Brave],[Bird]
467,870441663,The wild ponies of Assateague Island (Books fo...,Donna K Grosvenor,1975,National Geographic Society,http://images.amazon.com/images/P/0870441663.0...,"[Donna, K, Grosvenor]",[Donna],[Grosvenor],[K]
489,553285343,"RUSSIA HOUSE, THE",JOHN LE CARRE,1990,Bantam,http://images.amazon.com/images/P/0553285343.0...,"[John, Le, Carre]",[Carre],[John],[Le]
1420,037570504X,"Breath, Eyes, Memory",Edwidge Danticat,1998,Vintage Books USA,http://images.amazon.com/images/P/037570504X.0...,"[Edwidge, Danticat]",[],[],[Edwidge]
...,...,...,...,...,...,...,...,...,...,...
270816,1857939093,Peter Pan and Wendy,Barrie,1997,Pavilion Books,http://images.amazon.com/images/P/1857939093.0...,[Barrie],[],[],[Barrie]
270921,2265074217,"Les Experts, tome 1 : Double jeux",Max-Allan Collins,2003,Fleuve noir,http://images.amazon.com/images/P/2265074217.0...,"[Max-Allan, Collins]",[],[],[Collins]
270999,785300821,Treasury of Campbell's Recipes,Campbell Soup Company,1991,Publications International,http://images.amazon.com/images/P/0785300821.0...,"[Campbell, Soup, Company]",[],[],[Campbell]
271107,8401014441,El Jardinero Fiel,John Le Carre,2001,"Plaza &amp; Janes Editores, S.A.",http://images.amazon.com/images/P/8401014441.0...,"[John, Le, Carre]",[Carre],[John],[Le]


In [168]:
54596+135465+81318

271379

In [170]:
55045+138621+77713

271379

## 3. Algorithmic desision function

Based on a visual analysis of 100 entries, it was noticed that the author's name comes before the surname.
To resolve the ambiguity in determining the gender of book authors, you can use the following heuristic: if two genders match for one author, then we check - if a fragment of a name that has a match among female names comes before a fragment that has a match among male names, then we consider this author to be a woman. The opposite is also true.

In [173]:
double_match = (df_unsorted_books['matched_female'].apply(len) > 0 ) & (df_unsorted_books['matched_male'].apply(len) > 0) 
df_double_match = df_unsorted_books[double_match].copy()

In [174]:
df_double_match

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,"[Amy, Tan]",[Amy],[Tan],[]
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,"[Ann, Beattie]",[Ann],[Beattie],[]
17,1881320189,Goodbye to the Buttermilk Sky,Julia Oliver,1994,River City Pub,http://images.amazon.com/images/P/1881320189.0...,"[Julia, Oliver]",[Julia],[Oliver],[]
19,452264464,Beloved (Plume Contemporary Fiction),Toni Morrison,1994,Plume,http://images.amazon.com/images/P/0452264464.0...,"[Toni, Morrison]",[Toni],[Morrison],[]
21,1841721522,New Vegetarian: Bold and Beautiful Recipes for...,Celia Brooks Brown,2001,Ryland Peters &amp; Small Ltd,http://images.amazon.com/images/P/1841721522.0...,"[Celia, Brooks, Brown]",[Celia],"[Brooks, Brown]",[]
...,...,...,...,...,...,...,...,...,...,...
271340,3453047192,Die amerikanische Zumutung: PlГ¤doyers gegen d...,Rolf Winter,1990,W. Heyne,http://images.amazon.com/images/P/3453047192.0...,"[Rolf, Winter]",[Winter],[Rolf],[]
271341,9813056398,Broken Mirror : True Stories About Drug Abuse,Dawn Tan,2000,Angsana Books,http://images.amazon.com/images/P/9813056398.0...,"[Dawn, Tan]",[Dawn],[Tan],[]
271342,000637610X,You Got an Ology,Maureen Lipman,1990,HarperCollins Publishers,http://images.amazon.com/images/P/000637610X.0...,"[Maureen, Lipman]",[Maureen],[Lipman],[]
271347,471915645,Introductory Digital Signal Processing with Co...,Paul A. Lynn,1989,John Wiley &amp; Sons,http://images.amazon.com/images/P/0471915645.0...,"[Paul, Lynn]",[Lynn],[Paul],[]


In [177]:
def decision_func_female(name_chunks: list[str], matched_female: list[str], matched_male: list[str]) -> bool:
    female_indicies: list[int] = [ name_chunks.index(fem_chunk) for fem_chunk in matched_female ]
    male_indicies: list[int] = [ name_chunks.index(male_chunk) for male_chunk in matched_male ]
    if min(female_indicies) < min(male_indicies): return True
    else: return False

In [178]:
buffer: list[bool] = []
for _, row in tqdm(df_double_match.iterrows(), total=len(df_double_match)):
    decided_female = decision_func_female(
        name_chunks=row.name_chunks,
        matched_female=row.matched_female,
        matched_male=row.matched_male,
    )
    buffer.append(decided_female)

df_double_match['decided_female'] = buffer

100%|█████████████████████████████████████████████████████████████████████████| 64211/64211 [00:02<00:00, 21711.66it/s]


In [185]:
df_double_match[df_double_match['decided_female']]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive,decided_female
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,"[Amy, Tan]",[Amy],[Tan],[],True
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,"[Ann, Beattie]",[Ann],[Beattie],[],True
17,1881320189,Goodbye to the Buttermilk Sky,Julia Oliver,1994,River City Pub,http://images.amazon.com/images/P/1881320189.0...,"[Julia, Oliver]",[Julia],[Oliver],[],True
19,452264464,Beloved (Plume Contemporary Fiction),Toni Morrison,1994,Plume,http://images.amazon.com/images/P/0452264464.0...,"[Toni, Morrison]",[Toni],[Morrison],[],True
21,1841721522,New Vegetarian: Bold and Beautiful Recipes for...,Celia Brooks Brown,2001,Ryland Peters &amp; Small Ltd,http://images.amazon.com/images/P/1841721522.0...,"[Celia, Brooks, Brown]",[Celia],"[Brooks, Brown]",[],True
...,...,...,...,...,...,...,...,...,...,...,...
271317,792703103,The Hawk of Venice (Atlantic Large Print),Sally Wentworth,1990,Chivers North Amer,http://images.amazon.com/images/P/0792703103.0...,"[Sally, Wentworth]",[Sally],[Wentworth],[],True
271326,441166210,"The Dragon Hoard (Magic Quest, #6)",Tanith Lee,1984,Tempo,http://images.amazon.com/images/P/0441166210.0...,"[Tanith, Lee]",[Tanith],[Lee],[],True
271341,9813056398,Broken Mirror : True Stories About Drug Abuse,Dawn Tan,2000,Angsana Books,http://images.amazon.com/images/P/9813056398.0...,"[Dawn, Tan]",[Dawn],[Tan],[],True
271342,000637610X,You Got an Ology,Maureen Lipman,1990,HarperCollins Publishers,http://images.amazon.com/images/P/000637610X.0...,"[Maureen, Lipman]",[Maureen],[Lipman],[],True


In [192]:
df_double_match['decided_male'] = ~df_double_match['decided_female']

In [193]:
df_double_match

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive,decided_female,decided_male
5,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,"[Amy, Tan]",[Amy],[Tan],[],True,False
9,074322678X,Where You'll Find Me: And Other Stories,Ann Beattie,2002,Scribner,http://images.amazon.com/images/P/074322678X.0...,"[Ann, Beattie]",[Ann],[Beattie],[],True,False
17,1881320189,Goodbye to the Buttermilk Sky,Julia Oliver,1994,River City Pub,http://images.amazon.com/images/P/1881320189.0...,"[Julia, Oliver]",[Julia],[Oliver],[],True,False
19,452264464,Beloved (Plume Contemporary Fiction),Toni Morrison,1994,Plume,http://images.amazon.com/images/P/0452264464.0...,"[Toni, Morrison]",[Toni],[Morrison],[],True,False
21,1841721522,New Vegetarian: Bold and Beautiful Recipes for...,Celia Brooks Brown,2001,Ryland Peters &amp; Small Ltd,http://images.amazon.com/images/P/1841721522.0...,"[Celia, Brooks, Brown]",[Celia],"[Brooks, Brown]",[],True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
271340,3453047192,Die amerikanische Zumutung: PlГ¤doyers gegen d...,Rolf Winter,1990,W. Heyne,http://images.amazon.com/images/P/3453047192.0...,"[Rolf, Winter]",[Winter],[Rolf],[],False,True
271341,9813056398,Broken Mirror : True Stories About Drug Abuse,Dawn Tan,2000,Angsana Books,http://images.amazon.com/images/P/9813056398.0...,"[Dawn, Tan]",[Dawn],[Tan],[],True,False
271342,000637610X,You Got an Ology,Maureen Lipman,1990,HarperCollins Publishers,http://images.amazon.com/images/P/000637610X.0...,"[Maureen, Lipman]",[Maureen],[Lipman],[],True,False
271347,471915645,Introductory Digital Signal Processing with Co...,Paul A. Lynn,1989,John Wiley &amp; Sons,http://images.amazon.com/images/P/0471915645.0...,"[Paul, Lynn]",[Lynn],[Paul],[],False,True


In [194]:
df_female_books_detected = df_double_match[df_double_match['decided_female']].copy()
df_male_books_detected = df_double_match[df_double_match['decided_male']].copy()

## 4. Leftovers

The books I could not detect gender for.

In [196]:
unmatch = (df_unsorted_books['matched_female'].apply(len) == 0 ) & (df_unsorted_books['matched_male'].apply(len) == 0) 
df_unmatch = df_unsorted_books[unmatch].copy()

In [197]:
df_unmatch

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-M,name_chunks,matched_female,matched_male,matched_inconclusive
82,087113375X,Modern Manners: An Etiquette Book for Rude People,P.J. O'Rourke,1990,Atlantic Monthly Press,http://images.amazon.com/images/P/087113375X.0...,"[P.J., O'Rourke]",[],[],[]
84,743403843,Decipher,Stel Pavlou,2002,Simon &amp; Schuster (Trade Division),http://images.amazon.com/images/P/0743403843.0...,"[Stel, Pavlou]",[],[],[]
176,3150000335,Kabale Und Liebe,Schiller,0,"Philipp Reclam, Jun Verlag GmbH",http://images.amazon.com/images/P/3150000335.0...,[Schiller],[],[],[]
257,60563079,Peter Pan: The Original Story (Peter Pan),J. M. Barrie,2003,HarperFestival,http://images.amazon.com/images/P/0060563079.0...,[Barrie],[],[],[Barrie]
279,394586239,Possession: A Romance,A. S. Byatt,1990,Random House Inc,http://images.amazon.com/images/P/0394586239.0...,[Byatt],[],[],[]
...,...,...,...,...,...,...,...,...,...,...
271251,721402712,Little Red Riding Hood (Well Loved Tales),Ladybird Series,1982,Ladybird Books,http://images.amazon.com/images/P/0721402712.0...,"[Ladybird, Series]",[],[],[]
271337,375415343,A Whistling Woman,A.S. BYATT,2002,Knopf,http://images.amazon.com/images/P/0375415343.0...,"[A.S., Byatt]",[],[],[]
271349,3320016822,Urteil ohne Prozess: Margot Honecker gegen Oss...,JГ¶rn Kalkbrenner,1990,Dietz,http://images.amazon.com/images/P/3320016822.0...,"[Jг¶Rn, Kalkbrenner]",[],[],[]
271352,3525335423,Das Deutsche Kaiserreich 1871-1918.,Hans-Ulrich Wehler,1994,Vandenhoeck &amp; Ruprecht,http://images.amazon.com/images/P/3525335423.0...,"[Hans-Ulrich, Wehler]",[],[],[]


In [198]:
13502/271379*100

4.975329704951378