Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug during w2v training with utf8 characters #76

Merged
merged 7 commits into from
Nov 22, 2023

Conversation

pakhy2380
Copy link
Contributor

bug

when training w2v with Korean words(utf-8 characters), idmap['cols'] couldn't get utf-8

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

as-is

  • dtype=Sx
    • S: data type should be ascii characters

to be

  • dtype=h5py.string_dtype('utf-8')

@CLAassistant
Copy link

CLAassistant commented Sep 21, 2023

CLA assistant check
All committers have signed the CLA.

@chiwanpark
Copy link
Member

@pakhy2380 Please fix lint error (do not use single quote), and could you add a test case for this patch?

@pakhy2380
Copy link
Contributor Author

@chiwanpark
I have resolved the lint error and included a test case for Korean characters.

Is there any case that uid file is in unicode, not ascii? If so, the uid file would need to be modified, too.

@chiwanpark
Copy link
Member

@pakhy2380 Is there additional memory overhead when we use UTF-8 dtype? Then, how the size of memory overhead is?

If the memory overhead is negligible, please apply the dtype to uid. Otherwise, keep the current implementation for uid because the number of user IDs could be very large in general.

@chiwanpark chiwanpark self-requested a review November 3, 2023 08:48
@chiwanpark
Copy link
Member

@pakhy2380 BTW, maybe you need to clean the commit history (or do sign the our CLA). Currently, this PR contains the commits from multiple authors @pakhy2380 and @hugh-ga, and the last author haven't agreed to CLA.

@pakhy2380
Copy link
Contributor Author

@chiwanpark
I tested the new code and there was no increase in memory usage.
Here are the test results.

I will apply the same data type to uid as well.

original code

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    48     96.0 MiB     96.0 MiB           1       @profile
    49                                             def test00_text8(self):
    50    637.5 MiB    541.5 MiB           2           w = w2v.W2V(
    51     96.0 MiB      0.0 MiB           1               self.train_fname,
    52     96.0 MiB      0.0 MiB           1               D=200,
    53     96.0 MiB      0.0 MiB           1               iter=5,  # original: 15
    54     96.0 MiB      0.0 MiB           1               window=5,
    55     96.0 MiB      0.0 MiB           1               alpha=0.025,
    56     96.0 MiB      0.0 MiB           1               num_proc=5,  # original: 12
    57     96.0 MiB      0.0 MiB           1               negative=5,
    58     96.0 MiB      0.0 MiB           1               min_count=4,
    59     96.0 MiB      0.0 MiB           1               sample=0.001,
    60     96.0 MiB      0.0 MiB           1               preload=1,
    61     96.0 MiB      0.0 MiB           1               batch_size=10000,
    62                                                 )
    63    627.3 MiB    -10.2 MiB           1           w.train()
    64    671.3 MiB     44.0 MiB           1           w.save('wv')

new code (only iid)

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    48     95.2 MiB     95.2 MiB           1       @profile
    49                                             def test00_text8(self):
    50    633.9 MiB    538.7 MiB           2           w = w2v.W2V(
    51     95.2 MiB      0.0 MiB           1               self.train_fname,
    52     95.2 MiB      0.0 MiB           1               D=200,
    53     95.2 MiB      0.0 MiB           1               iter=5,  # original: 15
    54     95.2 MiB      0.0 MiB           1               window=5,
    55     95.2 MiB      0.0 MiB           1               alpha=0.025,
    56     95.2 MiB      0.0 MiB           1               num_proc=5,  # original: 12
    57     95.2 MiB      0.0 MiB           1               negative=5,
    58     95.2 MiB      0.0 MiB           1               min_count=4,
    59     95.2 MiB      0.0 MiB           1               sample=0.001,
    60     95.2 MiB      0.0 MiB           1               preload=1,
    61     95.2 MiB      0.0 MiB           1               batch_size=10000,
    62                                                 )
    63    627.0 MiB     -7.0 MiB           1           w.train()
    64    670.6 MiB     43.6 MiB           1           w.save('wv')

new code (including uid)

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    48     95.2 MiB     95.2 MiB           1       @profile
    49                                             def test00_text8(self):
    50    630.7 MiB    535.4 MiB           2           w = w2v.W2V(
    51     95.2 MiB      0.0 MiB           1               self.train_fname,
    52     95.2 MiB      0.0 MiB           1               D=200,
    53     95.2 MiB      0.0 MiB           1               iter=5,  # original: 15
    54     95.2 MiB      0.0 MiB           1               window=5,
    55     95.2 MiB      0.0 MiB           1               alpha=0.025,
    56     95.2 MiB      0.0 MiB           1               num_proc=5,  # original: 12
    57     95.2 MiB      0.0 MiB           1               negative=5,
    58     95.2 MiB      0.0 MiB           1               min_count=4,
    59     95.2 MiB      0.0 MiB           1               sample=0.001,
    60     95.2 MiB      0.0 MiB           1               preload=1,
    61     95.2 MiB      0.0 MiB           1               batch_size=10000,
    62                                                 )
    63    624.3 MiB     -6.4 MiB           1           w.train()
    64    668.5 MiB     44.2 MiB           1           w.save('wv')

test case info

[INFO    ] 2023-11-06 12:06:04 [w2v.py:44] Stream Header(17006, 253854, 16988201) Validation(17006 samples)
[INFO    ] 2023-11-06 12:06:04 [w2v.py:93] Caculating the frequency of words.
[INFO    ] 2023-11-06 12:06:48 [w2v.py:101] Reducing vocab with min_count(4).
[INFO    ] 2023-11-06 12:06:51 [w2v.py:109] Scaling vocab(82251) with sample(0.001).
[INFO    ] 2023-11-06 12:06:54 [w2v.py:124] Downsampled 38 most-common words.
[INFO    ] 2023-11-06 12:06:55 [w2v.py:133] Vocab(82251) TotalWords(16988201)

sample (3 rows)

anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institutions anarchists advocate social relations based upon voluntary association of autonomous individuals mutual aid and self governance while anarchism is most easily defined by what it is against anarchists also offer positive visions of what they believe to be a truly free society however ideas about how an anarchist society might work vary considerably especially with respect to economics there is also disagreement about how a free society might be brought about origins and predecessors kropotkin and others argue that before recorded history human society was organized on anarchist principles most anthropologists follow kropotkin and engels in believing that hunter gatherer bands were egalitarian and lacked division of labour accumulated wealth or decreed law and had equal access to resources william godwin anarchists including the the anarchy organisation and rothbard find anarchist attitudes in taoism from ancient china kropotkin found similar ideas in stoic zeno of citium according to kropotkin zeno repudiated the omnipotence of the state its intervention and regimentation and proclaimed the sovereignty of the moral law of the individual the anabaptists of one six th century europe are sometimes considered to be religious forerunners of modern anarchism bertrand russell in his history of western philosophy writes that the anabaptists repudiated all law since they held that the good man will be guided at every moment by the holy spirit from this premise they arrive at communism the diggers or true levellers were an early communistic movement during the time of the english civil war and are considered by some as forerunners of modern anarchism in the modern era the first to use the term to mean something other than chaos was louis armand baron de lahontan in his nouveaux voyages dans l am rique septentrionale one seven zero three where he described the indigenous american society which had no state laws prisons priests or private property as being in anarchy russell means a libertarian and leader in the american indian movement has repeatedly stated that he is an anarchist and so are all his ancestors in one seven nine three in the thick of the french revolution william godwin published an enquiry concerning political justice although godwin did not use the word anarchism many later anarchists have regarded this book as the first major anarchist text and godwin as the founder of philosophical anarchism but at this point no anarchist movement yet existed and the term anarchiste was known mainly as an insult hurled by the bourgeois girondins at more radical elements in the french revolution the first self labelled anarchist pierre joseph proudhon it is commonly held that it wasn t until pierre joseph proudhon published what is property in one eight four zero that the term anarchist was adopted as a self description it is for this reason that some claim proudhon as the founder of modern anarchist theory in what is property proudhon answers with the famous accusation property is theft in this work he opposed the institution of decreed property propri t where owners have complete rights to use and abuse their property as they wish such as exploiting workers for profit in its place proudhon supported what he called possession individuals can have limited rights to use resources capital and goods in accordance with principles of equality and justice proudhon s vision of anarchy which he called mutualism mutuellisme involved an exchange economy where individuals and groups could trade the products of their labor using labor notes which represented the amount of working time involved in production this would ensure that no one would profit from the labor of others workers could freely join together in co operative workshops an interest free bank would be set up to provide everyone with access to the means of production proudhon s ideas were influential within french working class movements and his followers were active in the revolution of one eight four eight in france proudhon s philosophy of property is complex it was developed in a number of works over his lifetime and there are differing interpretations of some of his ideas for more detailed discussion see here max stirner s egoism in his the ego and its own stirner argued that most commonly accepted social institutions including the notion of state property as a right natural rights in general and the very notion of society were mere illusions or ghosts in the mind saying of society that the individuals are its reality he advocated egoism and a form of amoralism in which individuals would unite in associations of egoists only when it was in their self interest to do so for him property simply comes about through might whoever knows how to take to defend the thing to him belongs property and what i have in my power that is my own so long as i assert myself as holder i am the proprietor of the thing stirner never called himself an anarchist he accepted only the label egoist nevertheless his ideas were influential on many individualistically inclined anarchists although interpretations of his thought are diverse
american individualist anarchism benjamin tucker in one eight two five josiah warren had participated in a communitarian experiment headed by robert owen called new harmony which failed in a few years amidst much internal conflict warren blamed the community s failure on a lack of individual sovereignty and a lack of private property warren proceeded to organise experimenal anarchist communities which respected what he called the sovereignty of the individual at utopia and modern times in one eight three three warren wrote and published the peaceful revolutionist which some have noted to be the first anarchist periodical ever published benjamin tucker says that warren was the first man to expound and formulate the doctrine now known as anarchism liberty xiv december one nine zero zero one benjamin tucker became interested in anarchism through meeting josiah warren and william b greene he edited and published liberty from august one eight eight one to april one nine zero eight it is widely considered to be the finest individualist anarchist periodical ever issued in the english language tucker s conception of individualist anarchism incorporated the ideas of a variety of theorists greene s ideas on mutual banking warren s ideas on cost as the limit of price a heterodox variety of labour theory of value proudhon s market anarchism max stirner s egoism and herbert spencer s law of equal freedom tucker strongly supported the individual s right to own the product of his or her labour as private property and believed in a market economy for trading this property he argued that in a truly free market system without the state the abundance of competition would eliminate profits and ensure that all workers received the full value of their labor other one nine th century individualists included lysander spooner stephen pearl andrews and victor yarros the first international mikhail bakunin one eight one four one eight seven six in europe harsh reaction followed the revolutions of one eight four eight twenty years later in one eight six four the international workingmen s association sometimes called the first international united some diverse european revolutionary currents including anarchism due to its genuine links to active workers movements the international became signficiant from the start karl marx was a leading figure in the international he was elected to every succeeding general council of the association the first objections to marx came from the mutualists who opposed communism and statism shortly after mikhail bakunin and his followers joined in one eight six eight the first international became polarised into two camps with marx and bakunin as their respective figureheads the clearest difference between the camps was over strategy the anarchists around bakunin favoured in kropotkin s words direct economical struggle against capitalism without interfering in the political parliamentary agitation at that time marx and his followers focused on parliamentary activity bakunin characterised marx s ideas as authoritarian and predicted that if a marxist party gained to power its leaders would end up as bad as the ruling class they had fought against in one eight seven two the conflict climaxed with a final split between the two groups at the hague congress this is often cited as the origin of the conflict between anarchists and marxists from this moment the social democratic and libertarian currents of socialism had distinct organisations including rival internationals anarchist communism peter kropotkin proudhon and bakunin both opposed communism associating it with statism however in the one eight seven zero s many anarchists moved away from bakunin s economic thinking called collectivism and embraced communist concepts communists believed the means of production should be owned collectively and that goods be distributed by need not labor an early anarchist communist was joseph d jacque the first person to describe himself as libertarian unlike proudhon he argued that it is not the product of his or her labor that the worker has a right to but to the satisfaction of his or her needs whatever may be their nature he announced his ideas in his us published journal le libertaire one eight five eight one eight six one peter kropotkin often seen as the most important theorist outlined his economic ideas in the conquest of bread and fields factories and workshops he felt co operation is more beneficial than competition illustrated in nature in mutual aid a factor of evolution one eight nine seven subsequent anarchist communists include emma goldman and alexander berkman many in the anarcho syndicalist movements see below saw anarchist communism as their objective isaac puente s one nine three two comunismo libertario was adopted by the spanish cnt as its manifesto for a post revolutionary society some anarchists disliked merging communism with anarchism several individualist anarchists maintained that abolition of private property was not consistent with liberty for example benjamin tucker whilst professing respect for kropotkin and publishing his work described communist anarchism as pseudo anarchism propaganda of the deed johann most was an outspoken advocate of violence anarchists have often been portrayed as dangerous and violent due mainly to a number of high profile violent acts including riots assassinations insurrections and terrorism by some anarchists some revolutionaries of the late one nine th century encouraged acts of political violence such as bombings and the assassinations of heads of state to further anarchism such actions have sometimes been called propaganda by the deed one of the more outspoken advocates of this strategy was johann most who said the existing system will be quickest and most radically overthrown by the annihilation of its exponents therefore massacres of the enemies of the people must be set in motion most s preferred method of terrorism dynamite earned him the moniker dynamost however there is no consensus on the legitimacy or utility of violence in general mikhail bakunin and errico malatesta for example wrote of violence as a necessary and sometimes desirable force in revolutionary settings but at the same time they denounced acts of individual terrorism malatesta in on violence and bakunin when
he refuted nechaev other anarchists sometimes identified as pacifist anarchists advocated complete nonviolence leo tolstoy whose philosophy is often viewed as a form of christian anarchism see below was a notable exponent of nonviolent resistance anarchism in the labour movement the red and black flag coming from the experience of anarchists in the labour movement is particularly associated with anarcho syndicalism anarcho syndicalism was an early two zero th century working class movement seeking to overthrow capitalism and the state to institute a worker controlled society the movement pursued industrial actions such as general strike as a primary strategy many anarcho syndicalists believed in anarchist communism though not all communists believed in syndicalism after the one eight seven one repression french anarchism reemerged influencing the bourses de travails of autonomous workers groups and trade unions from this movement the conf d ration g n rale du travail general confederation of work cgt was formed in one eight nine five as the first major anarcho syndicalist movement emile pataud and emile pouget s writing for the cgt saw libertarian communism developing from a general strike after one nine one four the cgt moved away from anarcho syndicalism due to the appeal of bolshevism french style syndicalism was a significant movement in europe prior to one nine two one and remained a significant movement in spain until the mid one nine four zero s the industrial workers of the world iww founded in one nine zero five in the us espoused unionism and sought a general strike to usher in a stateless society in one nine two three one zero zero zero zero zero members existed with the support of up to three zero zero zero zero zero though not explicitly anarchist they organized by rank and file democracy embodying a spirit of resistance that has inspired many anglophone syndicalists cnt propaganda from april two zero zero four reads don t let the politicians rule our lives you vote and they decide don t allow it unity action self management spanish anarchist trade union federations were formed in the one eight seven zero s one nine zero zero and one nine one zero the most successful was the confederaci n nacional del trabajo national confederation of labour cnt founded in one nine one zero prior to the one nine four zero s the cnt was the major force in spanish working class politics with a membership of one five eight million in one nine three four the cnt played a major role in the spanish civil war see also anarchism in spain syndicalists like ricardo flores mag n were key figures in the mexican revolution latin american anarchism was strongly influenced extending to the zapatista rebellion and the factory occupation movements in argentina in berlin in one nine two two the cnt was joined with the international workers association an anarcho syndicalist successor to the first international contemporary anarcho syndicalism continues as a minor force in many socities much smaller than in the one nine one zero s two zero s and three zero s the largest organised anarchist movement today is in spain in the form of the confederaci n general del trabajo and the cnt the cgt claims a paid up membership of six zero zero zero zero and received over a million votes in spanish syndical elections other active syndicalist movements include the us workers solidarity alliance and the uk solidarity federation the revolutionary industrial unionist industrial workers of the world also exists claiming two zero zero zero paid members contemporary critics of anarcho syndicalism and revolutionary industrial unionism claim that they are workerist and fail to deal with economic life outside work post leftist critics such as bob black claim anarcho syndicalism advocates oppressive social structures such as work and the workplace anarcho syndicalists in general uphold principles of workers solidarity direct action and self management the russian revolution the russian revolution of one nine one seven was a seismic event in the development of anarchism as a movement and as a philosophy anarchists participated alongside the bolsheviks in both february and october revolutions many anarchists initially supporting the bolshevik coup however the bolsheviks soon turned against the anarchists and other left wing opposition a conflict which culminated in the one nine one eight kronstadt rebellion anarchists in central russia were imprisoned or driven underground or joined the victorious bolsheviks in ukraine anarchists fought in the civil war against both whites and bolsheviks within the makhnovshchina peasant army led by nestor makhno expelled american anarchists emma goldman and alexander berkman before leaving russia were amongst those agitating in response to bolshevik policy and the suppression of the kronstadt uprising both wrote classic accounts of their experiences in russia aiming to expose the reality of bolshevik control for them bakunin s predictions about the consequences of marxist rule had proved all too true the victory of the bolsheviks in the october revolution and the resulting russian civil war did serious damage to anarchist movements internationally many workers and activists saw bolshevik success as setting an example communist parties grew at the expense of anarchism and other socialist movements in france and the us for example the major syndicalist movements of the cgt and iww began to realign themselves away from anarchism and towards the communist international in paris the dielo truda group of russian anarchist exiles which included nestor makhno concluded that anarchists needed to develop new forms of organisation in response to the structures of bolshevism their one nine two six manifesto known as the organisational platform of the libertarian communists was supported by some communist anarchists though opposed by many others the platform continues to inspire some contemporary anarchist groups who believe in an anarchist movement organised around its principles of theoretical unity tactical unity collective responsibility and federalism platformist groups today include the workers solidarity movement in ireland the uk s anarchist federation and the late north eastern federation of anarchist communists in the northeastern united

@chiwanpark
Copy link
Member

@pakhy2380 There is a lint error for unused packages (import numpy as np). Please fix it.

Copy link
Member

@chiwanpark chiwanpark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pakhy2380 LGTM, I'm merging...

@chiwanpark chiwanpark merged commit 3ddd9bf into kakao:dev Nov 22, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants